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ABSTRACT 

As a result of a recent College Board Admissions 
Testing Program Achievement Test scaling study, L. L. Cook and others 
recommended that the practice of sampling only high school juniors 
taking the achievement tests in June might be expanded to include 
sophomores and that a two-stage scaling procedure be evaluated. The 
two-stage procedure would include a first stage involving scaling 
tests on a within-cluster basis, but the second stage would involve 
taking the results of the first stage and following it with a second 
scaling in which scaled scores from the first stage would be used as 
input. This study experimentally evaluated these scaling 
recommendations, separating the Achievement Tests into a language 
test cluster and a nonlanguaje test cluster. In addition, high school 
sophomores were added to the samples. Results indicate that addition 
of the sophomores does not improve the relationship between 
Achievement Test scores and the scaling covariates. There is evidence 
to suggest that the use of empirical values for the reference group 
and the use of a two-stage scaling procedure may improve the 
alignment (f the Achievement Test scales. Seven tables and three 
figures present analysis results. (Contains nine references.) 
(SLD) 
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EXECUTIVE SUMMARY 

Cook, with the collaboration of Angoff and Schmict, recently carried out 
a College Board Admissions Testing Program (ATP) Achievement Test scaling 
study (see Cook, 1988). A goal of the Cook study was to provide several 
alternative scaling models for the Achievement Tests which would be based on 
the empirical evidence gathered in the study. It wa3 intended that, in these 
models, components such as scaling covariates , scaling samples, and 
characteristics of the reference group would be varied. In addition, it was 
anticipated that not all Achievement tasts studied would be amenable to 
similar treatment; and most likely the tests would be clustered by content 
area and alternative models would be specified for each cluster. 

As a result of the analyses carried out by Cook, suggestions were made 
for constructing scaling samples as well as the reference population. For 
instance, it was suggested that the practice of sampling only high school 
juniors taking the Achievement Tests in June for scaling purposes might 
possibly be altered to also include high school sophomores. It was also 
suggested that a two stage scaling procedure be evaluated. The two stage 
scaling procedure would include a first stage which would involve scaling 
cests solely on a wi thin-cluster basis; i.e., different covariates, different 
reference populations, and possibly even different sampling procedures would 
be used for c ch cluster. The second stage of the two-stage scaling procedure 
would involve taking the results of the first-stage scaling procedure just 
described and following it with a second scaling in which the scaled scores 
obtained from the first- stage scalings would be used as input. 

The purpose of this study was to experimentally evaluate the scaling 
recommendations provided by Cook (1988). The Achievement Tests were separated 
into two clusters: a language test cluster and a non-language test cluster, 



Tests within the non-language test cluster were then subjv^.cted to a first- 
stage scaling, using scores on SAT-V and SAT-M as covariates, where the means 
and standard deviations of the reference population for SAT-V and lAT-M were 
empirically determined using combined data from all the samples taking each of 
the non-language tests. For the language tests, a similar procedure, but 
making use of semesters of study in addition to SAT-V and SAT-M scores, was 
implemented; again, reference group means and standard deviations for SAT-V, 
SAT-M, and semesters of study were empirically determined from the combined 
daua from all samples taking each of the language tests. 

The language tests were also subjected to an additional first-stage 
scaling procedure. Tliis procedure can be thought of as an alternative to the 
procedure specified above for the first-stage scaling of the language cluster. 
The procedure consisted of scaling all of the language tests to the French 
Test, using SAT-V and -M scaled scores and semesters of study as covariates. 

Scaled scores resulting from the one first -stage scaling applied to the 
non-language tests and the two first-stage scaliiigs of the language tests were 
then subjected to second-stage scalings, using scores on SAT-V and SAT-M as 
covariates, but using empirically derived reference group values based on data 
from the combined sample derived from the individual samples for each language 
and non- language Achievement Test included in the study. 

The results of the experimental first- and second- stage scalings were 
then compared to the results of two single-stage scaling procedures. One 
procedure used empirically determined estimates for the reference group values 
based on the combined sample of all examinees from each of the language and 
non- language test samples, i.e., the same reference group used for the second 
stage scaling described above. The second procedure used 500 and 100 as 



reference population values fo^ SAT-V and SAT-M. Hence, all that differed 
between the two single-stage scalings were the reference group values used in 
the scaling equations. 

Finally, in an effort to assess the effects of including high school 
sophomores in the scaling samples from the June administration, sophomores 
were included in the scaling samples for Achievement Tests in Biology and 
Chemistry for the first- and second- stage experimental scalings that were 
carried out. 

The results of the study related to the sampling question indicate that 
addition of high school sophomores to the samples does not improve the 
relationship between Achievement Test scores and the scaling covariates (at 
least as evaluated by the correlations between Achievement Test score and the 
covariates) and thus is probably not an appropriate charge to consider. The 
results of the study related to the investigation of the two stage scaling 
procedure indicate there is some reasonable evidence to suggest that the use 
of empirical values for the reference group and use of the procedure that 
involves scaling the test in two stages may improve the alignment of the 
Achievement Test scales. A viable alternative involves application of the 
single-stage scaling procedure based on empirically determined estimates for 
the reference group values to the non- language tests and the two stage scaling 
to the language tests. This combination of procedures appears to provide a 
comparable degree of alignment of the scales as that provided by applying the 
two stage procedure to all tests, and should be somewhat easier to implement. 

The results of this study should be considered tentative and further 
research should be carried out. Procedures that involve modeling and 
correcting for the selection bias present in the Achievement Test scores might 
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prove fruitful. In addition, use of non-linear scaling procedures that should 
provide improved alignment for scores throughout the entire scaled score range 
for the tests might also be desirable and should be further investigated. 



Aligning Score Scales for Achievement Tests in Multiple Content Areas 



Linda L. Cook 
Daniel R. Eignor 
Elizabeth B. Burton 

INTRODUCTION 

The College Boarr" Admissions Testing Program (ATP) offers two varieties 
of tests: the Scholastic Aptitude Test (SAT) and the Achievement Tests. The 
SAT is a test of general verbal and mathematical developed ability that all 
examinees testing through ATP usually take. The Achievement Tests, on the 
other hand, are a battery of fourteen subject matter tests (fourteen tests 
when this study was done; a fifteenth. Modern Italian, was added in June 1990). 
Examinees testing at a particular date may take either one, two, or three of 
the fourteen tests. Moreover, the examinee is allowed to choose which of the 
tests he/she wants to take at the particular administration. Hence, the group 
of examinees taking any one of the Achievement Tests is a self -selected group, 
different from the self -selected group that may have chosen to take one of the 
other tests. Usually, however, score users wish to compare the scores of the 
groups of examinees who take the different Achievement Tests and, hence, some 
method of scaling, that aligns the scales of the various tests so that scores 
on the tests are on reasonably comparable scales, is necessary. 

The desired outcome of any procedure used to align the scales of the 

Achievement Tests is fairly evident. According to Angoff (1968): 

"The purpose of this scaling is to ensure that a candidate 
who chooses to compete with more able candidates is not 
put at a disadvantage; that is, that a candidate who is 
average in a highly selected group of candidates will earn 
a higher scaled score than a candidate who is average in a 
less able group." 

Procedures which may be used to scale the Achievement Tests to achieve 
comparability across tests are discussed in the next section of the paper. 
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These procedures form an extension of a subset of an overall set of procedures 

known as moderation procedures. Keeves (1988) has provided an excellent 

overview of moderation procedures and Cooney (1975, 1976) has described 

certain of these procedures in detail, along with their use in moderating 

examination marks in Australia. According to Keeves (1988): 

"Moderation is a procedure that was first employed at 
Oxford University to compare and equate levels of 
performance in the examinations conducted within the 
colleges of the university. The statistical procedures 
that have been developed to serve the purposes of equating 
levels of achievement on different examination papers have 
also come to be known as "moderation" . . . The function 
of moderation in this situation is to establish and 
maintain comparable standards between different 
examinations in the same subject area that are conducted 
on different syllabuses or in different settings. A 
further use of moderation occurs when a total score must 
be calculated from examination marks in different subject 
areas . " 

Keeves (1988) goes on to point out: 

"Howard (alias used by Sir Cyril Burt) (1958) identified 
the key requirements of moderation procedures. They are 
that candidates should not be disadvantaged by the marking 
pattern of examiners nor by the candidatures [examinee 
cohorts] with whom they compete. In practice these 
requirements demand that the same mark on different 
examinations should imply the same level of performance 
relative to a common population." 

As can be seen from the above quotes, the demands being made of 

moderation procedures in Australia are somewhat greater than those that are 

made with respect to ATP Achievement Tests. The expectation in Australia is 

that moderation will account for: 1) differences between subjects in the 

quality of students attracted; 2) differences between schools in the 

characteristics of students attending them; 3) differences between graders, 

both between and within schools, in the distributions of scores given; and 4) 

differences between courses of instruction studied by -tudents. With the ATP 
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Achievement Tests, scores are objectively determined by scanning multiple- 
choice answer sheets and it is assumed that students taking the exams have had 
suitable preparation. Further, with our public education system, it is 
assumed, on average, that little self -selection of schools takes place. 
Moderation is necessary with the ATP Achievement Tests simply to account for 
differences between subjects in the quality of students attracted. Moderation 
is frequently used in Australia even when there are no differences between 
subjects in the quality of students attracted, i.e., when all subjects are 
compulsory. The procedures are then used, for instance, to account for 
differences between graders in subjectively assigned marks given to students. 

Keeves (1988) discusses two general classes of moderation procedures that 
have been used in Australia: 1) those procedures for moderation that involve 
attributes of a common stimulus task to which groups of students are required 
to respond, and 2) those procedures used for moderation that are concerned 
with the attributes of the groups of students with respect to a. larger 
population of students. The first set of moderation procedures tjrpically 
involve the use of a moderator test taken by all students. According to 
Cooney (1976): 



^Keeves (1988) has presented simplified equations for bivariate adjustment, 
given in complete form by Cooney (1975) , that make the moderation process easier 
to understand. If the joint distribution of the moderator test scores and the 
achievement test scores is bivariate normal and the marginal distributions are 
normal and if the moderator test has a significant correlation r with the 
achievement test, then the moderated score for student j on achievement test i 
(T^j) may be expressed as 



where M^^ is the mean score on achievement test i, Xj is the score of student j 
on the moderator test, is the mean score on the moderator test, S^^ is the 
standard deviation of the achievement measure Y^, and is the standard 
deviation of the moderator test X. 



"With such a measure, scaling may then be achieved either 
by a modification of Pearson's bivariate adjustment method 
which leads to a linear transformation^, or by the 



Tij = + r (Xj - M,) 
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equipercentile method. . . Both methods depend heavily on the use 
of a moderator variable which correlates uniformly and highly with 
the marks being scaled." 

Keeves (1988) points out that the traditional procedure used to scale the ATP 

Achievement Tests, discussed in the next section of this paper, is an 

extension of the moderator variable method that involves two moderator tests 

(SAT-V and SAT-M) and for the language tests, a third moderator variable, 

semesters of study. Cooney (1975) points out that the bivariate adjustment 

procedure is seen as adequate only if there exists a moderator variable or set 

of variables which measure the attributes of performance in the courses of 

study and correlate highly and uniformly with the scores on the various 

subjects . 

There are two moderation procedures that make use of characteristics of 

examinees in use in Australia. The first usually involves the situation where 

there is considerable overlap in the groups of students taking different 

subjects. According to Keeves (1988): 

"The most obvious achievement characteristic that is 
available for adjusting the level of performance of a 
student group to allow for differences in the quality of 
the candidatures is the performance of the students on the 
other subjects that they sat on the same occasion." 

The second moderation procedure employs the characteristics of a student group 

given by the average performance of the group or. a general ability test. 

According to Keeves (1988) : 

"Superficially this procedure is similar to that 
associated with the use of a moderator test, and the 
similarity arises from the fact that for all examination 
subjects the correlation between the subject and the 
moderator variable is set at unity, to account for the 
differences in the magnitudes of the correlations which 
were seen to pose particular problems for that method. . . 
In practice this procedure. . . establishes a scholastic 
aptitude test scale on which the qualities of the 
candidatures of the different school and subject groups 
are measured. " 
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To summarize, the scaling procedures discussed in the next section of 



this paper form an extension of a subset of the total set of procedures called 
moderation procedures that have been in use for some time in Australia- -more 
particularly, these procedures are an extension of the subset of procedures 
that make use of a moderator test. The key to how effective moderation 
procedures perform in' this context is related to how the moderator variables 
correlate with the scores to be scaled- -the moderator variables should 
correlate highly and uniformly with the scores on the various subjects 
(Cooney, 1975). 



When the College Board Achievement Tests were introduced for the first 
time in 1942 for operational use in admissions, the tests were initially 
scaled in such a way that the mean for the group choosing to take each test 
was set at 500 and the standard deviation at 100. That is to say, t'ue average 
of each group of candidates taking its test was made to appear equal to the 
average performance of each of the other groups of candidates taking their 
tests. Similarly with their standard deviations. As a consequence of this 
scaling design, the score a candidate received on a test was clearly dependent 
on, among other things, the ability level of the group of examinees who took 
the particular test. For example, a candidate would appear more able if ha or 
she took a test typically chosen by a group of less able examinees and would 
appear less able if he or she chose a test typically taken by high ability 
examinees. Any candidate who understood the design of the score scales, and 
wished to appear to be relatively knowledgeable in his/her field, could adopt 
the strategy of selecting the Achievement Test normally taken by the least 
able candidates. In order to remove this element of unfairness, a scaling 
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system was designed in the middle 1940 's to adjust the scales for the several 
Achievement Tests to reflect the level and dispersion of ability of the 
candidates taking each test. With this system, a test typically taken by a 
more able group of candidates was made to yield an average scaled score higher 
than 500, and a test typically taken by a less able group of candidates was 
made to yield an average score lower than 500. 

The operational scale definition initially adopted in the middle 1940 's 
to achieve this result was that the candidate of average ability, relative to 
a hypothetical aggregate of all candidates taking the College Board tests, 
would, in theory earn a score of 500 regardless of the Achievement Test that 
he or she chose to take; also, that the dispersion of scaled scores for this 
hypothetical population would be defined with a standard deviation of 100 (see 
Wilks, 1961). Thus, higher ability Achievement Test groups would 
automatically have higher means and, correspondingly, lower ability 
Achievement Test groups would have lower means. This definition was 
implemented by defining "general ability" as measured by the verbal and 
mathematical Scholastic Aptitude Tests (SAT-V and SAT-M) , respectively; and 
the degree to which the SAT-V and -M scores played a role in this operational 
definition was a direct function of the relevance of those tests for the 
particular Achievement Test in question, as measured by the correlation of the 
SAT scores with the scores on the particular Achievement Test. A further 
adjustment was later introduced into this system by adding semesters of study 
to the SAT-V and -M scores for scaling the foreign language tests. This 
adjustment was intended to account for the fact that some languages were 
typically studied for longer periods of time than were other languages. 
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In 1959, Professor Samuel Wilks of Princeton University was engaged to 
review the work of equating and scaling the SAT and the Achievement Tests (see 
Wilks, 1961). The scope of his review included not only an examination of the 
particular methods in use at the time, but also an examination of the system, 
its philosophy, and its mode of implementation. One of the questions under 
consideration in his review was that of the relative emphasis to be placed on 
the efforts to perpetuate the scales of the individual Achievement Tests by 
providing undisturbed form- to -form equivalence through the equating process 
versus the emphasis to be placed on the efforts to maintain the appropriate, 
up-to-date inter-relationships among the scales for the tests, through the 
scaling process. Wilks recommended that scaling should be the first order of 
business. Accordingly, a plan was instituted to rescale the tests each year, 
and to average the results of the rescaling with equating results for that 
year. This plan was implemented in 1964 and applied annually until 1972, with 
the results incorporated into the scores reported for the following year's 
cohort. The expectation was that the differences between the scaling and 
equating efforts in this time would have diminished to the vanishing point. A 
review of the rescaling efforts in 1972 revealed, however, that the scaling 
operation was not moving consistently in one direction, but fluctuated from 
one rescaling to another, sometimes by sizable amounts. Hence, from 1972 to 
1978, rescaling was carried out bienially. In 1979 and 1980, the procedure 
was carried out annually. Rescaling was discontinued in 1980 because it was 
thought that the available methodology probably was not providing optimal 
results. Since 1980, only form- to -form equating has taken place in the 
program and the current scale for the Achievement Tests is the scale defined 
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in 1980 when rescaling was discontinued. However, since that time, a number 
of studies of the Achievement Test scaling process have been undertaken. 

In the early 1980' s, H. Braun and L. R. Tucker conducted studies (see 
Dorans, 1985) designed to investigate Achievement Test scaling that used both 
real and simulated data. These studies were undertaken to gain a better 
understanding of how operational decisions affected the outcomes of scaling. 
The effects of changing the definitions of the hjrpothetical reference group 
and of changing the definition of the samples of Achievement Test takers used 
for calculating the scaling equations, as well as the effects of various 
choices of covariates, were given particular attention. The results of the 
Braun and Tucker studies indicated that choice of the reference population 
impacts scaling results as does the relationship of the Achievement Test score 
with the scaling covariates. 

Cook, with the collaboration of Angoff and Schmitt, (Cook, 1988) recently 
carried out an Achievement Test scaling study. The purpose of the Cook study 
was to explore the relationships between College Board Achievement Test scores 
and potential scaling covariates for various subgroups of the test taking 
population. It was speculated that such an exploration would lead to the 
following: 

• The selection of additional scaling covariates that might provide 
improved scaling results for those tests that do not provide scores 
correlating highly with SAT-V and/or SAT-M scores; 

• An improved specification of the characteristics of the sample of 
Achievement Test examinees that are used for the scaling of the 
tests, i.e., such a specification might possibly lead to Achievement 
Test scores that show a higher correlation with selected scaling 
covariates and; 

• An improved specification of the reference group (population) used 
for the scaling. As Braun and Tucker pointed out, the character- 
istics of the hypothetical population differentially affect the 
scales of the tests. A change in specifications for this 

16 
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population, from the traditionally specified SAT-V and SAT-M means 
and standard deviations of 500 and 100, respectively, might possibly 
provide improved scales for some of the tests. 

The final goal of the Cook study was to provide several alternative 

scaling models for the Achievement Tests which would be based on the empirical 

evidence gathered in the study. It was intended that, in these models, 

components such as scaling covariates, scaling samples, and characteristics of 

the reference group would be varied. In addition, it was anticipated that not 

all Achievement tests studied would be amenable to similar treatment; and most 

likely the tests would be clustered by content area and alternative models 

would be specified for each cluster. 

As a result of the analyses carried out by Cook, suggestions were made 

for constructing scaling samples as well as the reference population. It was 

also suggested that a two stage scaling procedure be evaluated. The two stage 

scaling procedure would include a first stage which would involve scaling 

tests solely on a within- cluster basis; i.e., different covariates, different 

reference populations, and possibly even different sampling procedures would 

be used for each cluster. 

The second stage of the two-stage scaling procedure would involve taking 

the results of the first-stage scaling procedure just described and following 

it with a second scaling in which the scaled scores obtained from the first - 

stage scalings would be used as input. It should be mentioned that Cook did 

not identify, as a result of her analyses, additional covariates that could be 

used in either the first or second stages of the two stage scaling procedure 

that was suggested. 
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PURPOSE 

The purpose of this study was to experimentally evaluate the scaling 
reconunendations provided by Cook (1988) . The Achievement Tests were separated 
into two clusters: a language test cluster and a non- language test cluster. 
Tests within the non-language test cluster were then subjected to a first- 
stage scaling, using scores on SAT-V and SAT-M as covariates, where the means 
and standard deviations of the reference population for SAT-V and SAT-M were 
empirically determined using combined data from all the samples taking each of 
the non-language tests. For the language tests, a similar procedure, but 
making use of semesters of study in addition to SAT-V and SAT-M scores, was 
implemented; again, reference group means and standard deviations for SAT-V, 
SAT-M, and se.flesters of study were empirically determined from the combined 
data from all samples taking each of the language tests. 

The language tests were also subjected to an additional first-stage 
scaling procedure. This procedure can be thought of as an alternative to the 
procedure specified above for the first-stage scaling of the language cluster. 
The procedure consisted of scaling all of the language tests to the French 
Test, using SAT-V and -M scaled scores and semesters of study as covariates. 

The French Test was chosen as the base test not only because it has been, 
until recently, the largest volume language Achievement Test, but also because 
of certain properties of the French Test scale, i.e., for a number of French 
Test forms, the maximum raw score produces a scaled score of 800; this is true 
to a lesser extent for the other language Achievement Tests. In addition, 
scores on the French Test correlate more highly with SAT-V and -M scores than 
do scores on the other language tests except for Latin. 



18 



- 11 - 

Scaled scores resulting from the one first-stage scaling applied to the 
non- language tests and the two first-stage scalings of the language tests were 
then subjected to second- stage scalings, using scores on SAT-V and SAT-M as 
covariates, but using empirically derived reference group values based on data 
from the combined sample derived from the individual samples for each language 
and non- language Achievement Test included in the study. 

The results of the experimental first- and second-stage scalings were 
then compared to the results of two single-stage scaling procedures. One 
procedure used empirically determined estimates for the reference group values 
based on the combined sample of all examinees from each of the language and 
non- language test samples, i.e., the same reference group used for the second 
stage scaling described above. The second procedure used 500 and 100 as 
reference population values for SAT-V and SAT-M. Hence, essentially all that 
differed between the two single-stage scalings were the reference group values 
used in the scaling equations. 

METHODOLOGY 
Description of the Tests 
The thirteen^ Achievement Tests used in the study fall into five general 
subject areas: 
English 

English Composition (two versions: all multiple -choice and multiple 

choice with essay) 

Literature 



^The Achievement Test in Hebrew was excluded from this study because, at the 
time of the study, the test was undergoing redevelopment to make it more relevant 
to the current Hebrew test -taking population. 
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Foreign Languages 

French 
German 
Latin 
Spanish 

History and Social Studies 

American History and Social Studies 
European History and World Cultures 

Mathematics 

Mathematics Level I 
Mathematics Level II 

Sciences 

Biology 

Chemistry 

Physics 

All the Achievement Tests take one hour of testing time, and consist 
entirely of multiple-choice questions except the English Composition Test with 
Essay, which consists of a 20 minute essay and 40 minutes of multiple-choice 
questions. The tests vary in content as well as in the number of multiple- 
choice items they contain. The approximate number of questions contained in 
each test is listed in the table below. 

Test Approximate Number of Questions 

English Composition with Essay 70 multiple-choice items plus one essay 



English Composition 85 

Literature 60 

French 85 

German 80 

Latin 70 

Spanish 85 

American History and Social Studies 95 

European History and World Cultures 95 

Mathematics Level I 50 

Mathematics Level II 50 

Biology 95 

Chemistry 85 

Physics 75 
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Scores for all Achievement Tests are reported on scales that range from 200 to 
800. 

The thirteen Achievement Tests were split into two clusters for certain 




two clusters are: 

Language Test Cluster 

French 
Spanish 
German 
Latin 



Non- lang ua ge Test Cluster 

English^ 
Literature 

American History and Social Studies 

European History and World Cultures 

Mathematics Level I 

Mathematics Level II 

Biology 

Chemistry 

Physics 



Description of the Samples 

The Achievement Test data used to provide a base for the scalings in this 

study were obtained from the following Achievement Test administrations: 

May 1987 
June 1987^ 
November 1987 
December 1987^ 
January 1988 

For each of the administrations, examinees were selected as follows: 

May 1987: All juniors taking an Achievement Test (or Tests) 

June 1987: For all tests except Biology and Chemistry- -all juniors 

taking the Achievement Test (or Tests) 

For Biology and Chemistry- -ail juniors and all sophomores 
taking the Achievement Test (or Tests) ^ 



^The English Composition with Essay and the all-objective English 
Composition tests are placed on the same score scale via the score equating 
process. Scaled scores for both tests were used interchangeably in this study. 

^The small volume tests, European History and World Cultures, German, and 
Latin are offered only at the December and June administrations. 

^Sophomores were included in the scaling samples for Biology and Chemistry 
only for the experimental scalings carried out in this study in an attempt to 
produce improved scaling results. For the two single-stage scalings, sophomores 
were not included in the scaling samples. 
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November 1987: All seniors taking the Achievement Test (or Tests) 

December 1987: All seniors taking the Achievement Test (or Tests) 

January 1988: All seniors taking the Achievement Test (or Tests) 

Examinees needed to have SAT-V and SAT-M scores from at least one of the 

following seven administrations to be included in the samples. 

April 1987 November 1987 

May 1987 December 1987 

June 1987 January 1988 

October 1987 

It should be noted that examinees cannot, at present, take the SAT and the 
Achievement Tests at the same administration date." 

Examinees taking the French, German, Spanish, and Latin tests were 
included in the sample only if they responded to the question on the 
background questionnaire having to do with semesters of study. Table 1 
provides the background questionnaire which was in use with the French Test at 
the time data were collected for this study. The same questionnaire, with the 
appropriate name change, is used with the German and Spanish Tests while a 
very similar questionnaire is used with Latin. To be included in the sample 
for the French, German, and Spanish Tests, examinees had to have marked one of 
responses 3-8 to the background question. For the Latin Test, examinees had 
to have marked one of responses 2-8 to the background question for that test. 
For use in the scaling equations, French, German, and Spanish background 
responses 3, 4, 5, 6, 7, and 8 were receded as 4, 5, 6, 7, 8, and 9 semesters 
of study, respectively. For Latin, background responses 2, 3, 4, 5, 6, 7, and 
8 were receded as 3, 4, 5, 6, 7, 8, and 9 semesters of study, respectively. 

Insert Table 1 about here 
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Scalings 

All scalings in this study were carried out using Tucker equations for 
two anchor and three anchor scalings described in detail by Angoff (Angoff , 
1984, pp. 108-111 and pp, 132-133). These scalings differed, however, in how 
reference group means, variances, and covariances of SAT scores and semesters 
of study were established and in the number of stages used in the scalings- - 
single or two-stage scalings. 

The Tucker two- anchor scaling equations for estimating scaled score means 
(M) and scaled score standard deviations (S) are: 

Mx = Mx + bvx(Mv-Mv) + bu^(t^-K) (1) 
S\ = S\ + b2^(a^-S^) + h^a^(a\-S\) + 2b^b^(a^-C^) ; (2) 
and th ' Tucker three-anchor scaling equations are: 

Mx = + h^(fi^-l^) + b^(;i„-M„) + \^(fi^-H^) (3) 
S\ = S\ + b2^(a\-S2,) + h^^(a\-S\) + h\,(a\-S\) + 

(4) 

Zbvxb^cC^'vm-Cvm) + Zb^b.^Ca^g-C^s) + 2b^b3^(a„3-C„3) ; 
where M, S^, C, and b represent the observed mean, variance, covariance, and 
partial regression coefficient, respectively, of the subscripted variables; fi, 

, and cr, represent the reference group mean, variance, and covariance, 
respectively, of the subscripted variables; and x, v, m, and s represent 
Achievement Test scaled scores, SAT-V scaled scores, SAT-M scaled scores, and 
semesters of study, respectively. 

Once estimates of Achievement Test means and standard deviations have 
been obtained using the two- or three-anchor scaling equations, these 
estimates are used to obtain linear scaling parameters as follows: 
(X-Mx)/Sx = (T-500)/100, 
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which yi.elds a linear equation of the form T = AX + B, where 

A = lOO/Sjj and B ■= SOO-ACM^j). (5) 

Single-Stage Scalings 

Equations (1) , (2) , and (5) were used for the single-stage scalings of 
the nine non- language Achievement Tests, with the reference group values for 
/i^, o^, a^, and set equal to 500, 500, 100, 100, and 6,000, 

respectively. 

Equations (3), (4), and (5) were used for the single-stage scalings of 
the four language Achievement Tests. The reference group values for y,^, y,^, 
a^, a^, and were again set equal to 500, 500, 100, 100, and 6,000, 
respectively. The reference group values for n^, o^, tr^j, and tr^j were set 
equal to the corresponding observed values in r.he combined sample of all 
language test-takers formed by pooling the four language samples. 

Empirical Single-Stage Scalings 

The empirical single-stage scalings of the thirteen Achievement Tests 
mirrored the single-stage scalings except that the reference group values for 
H^, n^, a^, a^, and a,^ for each test were set equal to the corresponding 
observed values in the combined sample of all test-takers, formed by pooling 
all thirteen samples. As with the single-stage scaling for the language 
tests, cTg, cTvs. ^^'^ '^cns were set equal to the corresponding values in the 
combined sample of all language test takers formed by pooling the four 
language samples . 

First-Stage Scalings 

The first -stage scalings of each of the thirteen Achievement Tests 
mirrored the empirical single-stage scalings except that the reference group 
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values for /i^, fi^, o^, and for the nine non-language tests were set 

equal to the corresponding values in the combined sample of all non- language 
test-takers (formed by pooling the nine non-language samples), and the 
reference group values for the four language tests were set equal to the 
corresponding values in the combined sample of all language test- takers. 

First-Stage Scalinps of Langu?.g:e T es ts to French Test Scale 

Equations (3) and (4) were used to perform the first-stage scaling of the 
German, Latin, and Spanish Tests tc the French Test scale. For each of these 
three tests, the reference group values for n^. Mm. Ms. «^v. «^m. ^^s. «^vm. ^^vs. 
and a^^ were set equal to the corresponding values in the combined sample 
formed by pooling the specific language sample (either German, Latin, or 
Spanish) with the French sample. Linear scaling parameters for each of the 
tests were then derived as follows: 

A = Sf/S,, and B = Mf - A(Mx) , (6) 
wherfe Mf and Sf represent the scaled score mean and standard deviation, 
respectively, for the French Test sample. It should be noted that French Test 
scores were not rescaled in applying this procedure. 

Second-Stage Scalings 

All second- stage scalings for each of the thirteen Achievement Tests were 
carried out using Equations (1), (2), and (5) applied to the scaled scores for 
each test derived from the first-stage scaling. Consequently, the second- 
stage scaling was done twice for the language tests, once using the first- 
stage scaling results based on the empirically derived reference group (to be 
referred to as Second-Stage Scaling A) and once using the first-stage scaling 
to French results (to be referred to as Second-Stage Scaling B) . In all 
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second-stage scalings, the reference group values for n^, pi^, a^, a^, and 
were set equal to the corresponding observed values in the combined sample of 
all test-takers, formed by pooling all thirteen Achievement Test samples. 

Data Used in Scalings 

Tables 2 and 3 sumniaiize the data used as input for the various scalings 
performed. Table 2 contains scsled score summary statistics for the reference 
groups or populations and Table 3 contains scaled score summary statistics 
used for the Achievement Test scaling samples. 

Insert Tables 2 and 3 about here 

As seen in Table 2 and explained previously, the reference group values 
for SAT-V and SAT-M for the second stage scaling and empirically based single- 
stage scaling are identical. It is also apparent, from examination of the 
reference group values for SAT-V and SAT-M given in Table 2 for the first 
stage scaling of the non-language tests, that these values are very similar to 
those used for the second stage scaling and the empirically based single stage 
scaling. This is due to the fact that the non- language tests dominate the 
aggregate used to obtain the reference group values for these two scalings. 
The reason this occurs is: 1) the non-language tests outnumber the language 
tests (there are nine non-language tests compared to the four foreign language 
tests); and, 2) in general, the non-language tests are the larger volume 
•tests and, therefore, they have a greater impact on the aggregate statistics 
than do the language tests. It should also be noted that the language and 
non- language test groups are reasonably similar in ability, as measured by 
SAT-V and SAT-M scores. 

26 
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Summary of Scalings Performed 

As a result of application of the scaling procedures, linear scaling 
parameters and "new" scaled scores for each of the thirteen Achievement Tests 
were created as follows: 

1. Single-stage scaling --a single set of linear scaling parameters for 
each of the thirteen tests. 

2. Empirical single-stage scaling -- a single set of linear scaling 
parameters for each of the thirteen tests. 

3. First-stage scaling -- A single set oi: first-stage scaling parameters 
for each of the non-language tests and two sets of first-stage 
results for each of the language tests, one set based on using an 
empirically derived reference group and the other set based on 
scaling the Spanish, German, and Latin Tests to the French Test 
scale . 

4. Second-stage scaling -- A single set of second-stage scaling 
parameters for each of the non-language tests and two sets of second- 
stage results for each of the language tests. 



The results of the analyses were evaluated by comparing Achievement Test 
scaled scores obtained from the application of the experimental scaling 
procedures. The results were compared in several ways. First, the results 
were compared to determine if the rank orderings of the Achievement Test 
scaled score means were similar to what was to be expected, given the ability 
levels (as measured by SAT-V and SAT-M scaled scores) of the groups taking the 
tests . 



Evaluation of Results 
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Second, the results of the experimental scalings were evaluated by 
examining the Achievement Tc^'jt scaled score means obtained from application of 
the experimental scaling parajfteters after conditioning on SAT-V and SAT-M 
scaled scores and semesters of study. The assumption underlying the 
conditioning procedure is that if the scales of the Achievement Tests are 
aligned, scaled score means on the tests will be somewhat similar for groups 
at the same ability level, as measured by the scaling covariates. 

Results of the study were also evaluated by examining the relationship 
between Achievement Test scaled score means for pairs of Achievement Tests 
where each pair was taken by some reasonably sized group of examinees. Again, 
the assumption was that if the test scales were aligned, the scaled score 
means obtained for each pair of tests taken by each group of examinees would 
be reasonably similar. 

RESULTS 

The results of the analyses conducted for this study are summarized in 
Tables 4-7 and Figures la-3d. The information provided in Table 4 summarizes 
the results of applying the linear parameters obtained from the four 
experimental scaling procedures used for the non-language tests and the six 
experimental procedures used for the language tests to values at fifty point 
intervals on the current Achievement Test scales, which range from 200 to 800. 
The scale values in the left column are labeled current scale and indicate the 
existing scale values for each of the cests prior to application of any of the 
experimental scaling results. 

Insert Table 4 about here 
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Examination of the data presented for the English Composition Test in 
Table 4 indicates that the results of the first and second stage scalings and 
the empirically based single-stage scaling are almost identical. All three of 
these scaling procedures result in scaled scores that are somewhat lower than 
scaled scores on the current scale. 

Only the single-stage scaling results shown in Table 4 for the English 
Composition Test provide scaled scores that differ from those provided by the 
other three experimental procedures. The major difference between this and 
the other procedures is the way in which the. reference population is defined. 
For the single-stage scaling procedure, the reference population is defined to 
have a scaled score mean of SAT-V and SAT-M scores of 500 and a standard 
deviation of scaled scores of 100 for both tests. For the other three 
procedures , empirical values obtained by aggregating across samples taking the 
actual tests were used. 

The results shown in Table 4 for the Literature Test, American History 
Test, and European History Test are quite similar to those obtained for the 
English Composition Test. For all of these tests, the results of the first 
and second stage scalings and the empirically based single-stage scaling are 
quite similar. For all tests, the results of these three procedures provide 
scaled scores that are similar and somewhat lower than those provided by the 
single-stage scaling procedure. As was the case for the English Composition 
Test, the results of the single-stage procedure applied to the American 
History and European History Tests provided scaled scores somewhat higher at 
the upper end of the score range and somewhat lower at the lower end of the 
score range than those associated with the current scale. For the Literature 
Test, scaled scores obtained from application of the single-stage procedure 
were almost identical to scores on the current scale. 
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The results of the experimental ^scalings of the Mathematics Level I and 
II Tests, displayed in Table 4, are somewhat different from those obtained for 
the previously discussed tests in that the single-stage scaling for the Level 
II Test does not provide results that are similar to those obtained for the 
other tests. As can be seen, results obtained for the Mathematics Level I 
Test are similar to the other tests evaluated so far in that the scaled scores 
from the single-stage results are somewhat higher at the upper end of the 
score range and lower at the lower end of the range than scaled scores on the 
current scale. On the other hand, the single-stage results obtained for the 
Mathematics Level II Test are consistently lower than the current scale 
throughout the entire scaled score range . 

Examination of the information provided in Table 4 for the Biology, 
Chemistry, and Physics Tests shows that the scaled scores provided by all the 
experimental procedures, with the exception of the single-stage scaling 
procedure, are similar to those obtained for the other tests discussed so far. 
The single-stage scaling procedure used with the Biology Test provides scaled 
scores that are somewhat lower at the top and higher at the bottom of the 
scale score range when compared to scaled scores on the current scale. 
Results of the single-stage scaling procedure used with the Chemistry and 
Physics Tests provide scaled scores that are slightly higher at the top of the 
scaled score range and also somewhat higher at the bottom of the range when 
compared with scaled scores on the current scale. 

The scaling results for the French Test provided in Table 4 indicate that 
four of the experimental procedures, first stage scaling, second stage 
scalings A and B, and the empirically based single-stage scaling, all yield 
quite similar results. The results of these four procedures all provide 
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scores that are lower throughout the scaled score range than scaled scores on 
the current scale. The results of scaling to the French Test are, of course, 
identical to the current scale. The results of the single-stage scaling 
provide scaled scores that are similar in the upper end of the score range and 
somewhat higher in the lower end of the score range than scaled scores on the 
current scale. 

The results for the experimental scaling procedures used with the German 
Test, which are summarized in Table 4, indicate that the two second stage 
scaling procedures and the empirically based single-stage procedure all 
provide results that are quite similar. These procedures provide scaled 
scores that are somewhat lower than scaled scores on the current scale. 
Scaling to the French Test and single-stage scaling provide fairly similar 
results when applied to the German TestV Both of these procedures provide 
results that are slightly higher at the top of the scaled score range and 
somewhat lower at the bottom of the range than scaled scores on the current 
scale . 

The Latin Test results presented in Table 4 show that the two second 
stage scaling procedures provide almost identical results; i. e., scaled 
scores that are lower at the top of the scale and almost the same at the 
bottom of the scale as scaled scores on the current scale. In addition, the 
results of the first stage scaling procedure and the empirically based single- 
stage scaling procedure are very similar, providing scaled scores that are 
lower at both the top and bottom of the scaled score range than scaled scores 
on the current scale. The results of scaling to the French Test and the 
single-stage scaling procedures -aije different from each other and from the 
results of the other procedures. The results of scaling to the French Test 
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provide scaled scores that are lower at the bottom and the top than scaled 
scores on the current scale. The results of the single-stage scaling 
procedure provide scores that are somewhat lower at the top and higher at the 
bottom than scaled scores on the current scale. 

Data provided in Table 4 for the Spanish Test indicate that the results 
of all the scaling procedures, with the exception of scaling to the French 
Test and single-stage scaling, are quite similar to each other. These 
procedures all have a tendency to provide scaled scores -that are lower at both 
the top and bottom when compared with scaled scores on the current scale . The 
results of scaling to the French Test and single-stage scaling are similar in 
that both the procedures provide scaled scores that are higher at the top and 
lower at the bottom than scaled scores on the current scale. 

The effect of the various scaling procedures on the summary statistics 
provided for the thirteen Achievement Tests used in this study can be seen by 
examining the information provided in Table 5. Table 5 presents the 
Achievement Test scaled score summary statistics resulting from application of 
each of the experimental scaling procedures as well as the summary statistics 
and correlations of the scaling covariates, SAT-V and SAT-M, and semesters of 
foreign language study (for the language tests), with Achievement Test scores. 

Insert Table 5 about here 

It is clear from examination of the information provided in Table 5 for 
the English Composition Test that application of the first and second stage 
scaling results and the empirically based single-stage scaling procedure 
result in similar scaled score summary statistics. Application of the results 
of the single-stage scaling procedure provides scaled score summary statistics 
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that are quite different from those obtained by the other experimental 
procedures. It is clear that none of the experimental procedures result in 
scaled score means that are similar to those obtained using the current scale 
values for the test. 

Information provided in Table 5 for the Literature Test shoWs close 
agreement among summary statistics obtained by application of the results of 
the first and second stage scalings and the empirically based single-stage 
scaling procedure. In addition, scaling results obtained by application of 
the single-stage scaling procedure agree almost perfectly with summary 
statistics obtained using current scale values. 

Results of the experimental scalings of the American History and European 
History Tests, presented in Table 5, are similar to those obtained for the 
Literature Test. These results indicate a high level of agreement among 
scaled score summary statistics obtained for the first and second stage 
scalings and the empirically based single-stage scaling procedure applied to 
these two tests. As was observed for the Literature Test, results obtained 
for the single-stage scaling procedure used with the American History and 
European History Tests are very similar to summary statistics obtained using 
the current scale values. 

Information provided in Table 5 for the Math Level I and Level II Tests 
are similar to that provided for the tests discussed so far in that the first 
and second stage scalings and the empirically based single-stage scaling all 
provide similar scaled score summary statistics for the respective tests. On 
the other hand, the results for both of these tests are similar to those 
obtained for the English Composition Test in that the summary statistics 
obtained as a result of applying the single-stage scaling procedure differ 
somewhat from those associated with the current scale values. 
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The results provided in Table 5 for the Biology, Chemistry, and Physics 
Tests show that the summary statistics resulting from the application of the 
first and second stage scalings and the empirically based single-stage scaling 
are similar for each of the respective tests. For the Biology Test, the 
summary statistics resulting from application of the single-stage scaling 
procedure agree quite closely with those resulting from the current scale 
values. For both the Chemistry and Physics Tests, summary statistics 
resulting from the application of the single-stage scaling procedure are 
somewhat different from those obtained using current scale values. 

Examination of the results presented in Table 5 for the French Test 
indicate close agreement among the sximmary statistics resulting from 
application of the first stage scaling and the second stage scalings A and B. 
Of course, the summary statistics obtained from scaling to the French Test are 
identical to the current scale values. The summary statistics provided by the 
single-stage scaling results and the empirically based single-stage scaling 
results differ from each other and also from the current scale values. 

The results presented in Table 5 for the German Test are inconsistent 
with those obtained for the French Test. Summary statistics obtained for the 
two second stage scalings agree quite closely with each other. Summary 
statistics obtained by scaling to the French Test and from the single-stage 
scaling procedure also agree closely with each other and are reasonably close 
to the summary statistics derived from the current scale. 

The results of the Latin Test scalings presented in Table 5 indicate that 
the only procedures providing similar summary statistics are the two second 
stage scaling procedures. The summary statistics resulting from the first 
stage scaling and the empirically based single-stage scaling agree somewhat 
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for this test, as do the summary statistics obtained by application of the 
single-stage scaling procedure and the current scale values. Summary 
statistics obtained by scaling to the French Test are not in close agreement 
with those obtained by any of the other experimental scaling procedures . 

The Spanish Test scaling results presented in Table 5 indicate reasonably 
close agreement among the summary statistics obtained for the first and second 
stage scalings and the empirically based single-stage scaling. Summary 
statistics resulting from scaling to the French Test and the single-stage 
scaling procedure are in reasonably close agreement and agree fairly well with 
summary statistics obtained using the current scale values. 

The additional information presented in Table 5 that should be noted at 
this point are the correlation coefficients of Achievement Test scores with 
the scaling covariates. It can be seen, from examination of the information 
presented in Table 5, that the thirteen tests generally fall into three 
categories; 1) tests that correlate highly with SAT-V scores; 2) tests that 
correlate highly with SAT-M scores; and, 3) tests that do not correlate highly 
with either SAT-V or SAT-M scores. 

Tests such as English Composition, American History, Literature, and 
European History are all tests that show a higher relationship between 
Achievement Test scores and SAT-V siiores than between Achievement Test scores 
and SAT-M scores. Tests showing a higher correlation of Achievement Test 
scores with SAT-M scores than with SAT-V scores are the two Math tests and the 
Physics and Chemistry Tests. The Biology Test scores, unlike the other 
science test scores, show a slightly higher relationship with SAT-V scores 
than with SAT-M scores. Scores on the foreign language tests do not correlate 
particularly well with either SAT-V or SAT-M scores. The language test scores 



35 



- 28 - 

that show the highest relationship with scores on SAT-V and SAT-M are the 
Latin Test scores. The language test which has scores that exhibit the lowest 
correlations with SAT-V and SAT-M scores is the German Test. 

One way to evaluate the results of the experimental scaling procedures is 
to evaluate the rank ordering of the scaled score means obtained for the 
groups actually taking the tests in relationship to the groups' ability 
levels, as assessed by the scaling covariates SAT-V and SAT-M. If the 
underlying abilities measured by the various Achievement Tests were equally 
and perfectly correlated with abilities measured by the covariates, and the 
scales of the tests were aligned, one would expect the rank ordering of the 
group means obtained on the Achievement Tests to iritch those obtained by the 
groups on the covariate measures. As just noted, the tests differ in their 
relationship to the covariates, so an examination of the ranking of the 
Achievement Test scaled score means in relationship to SAT-V and SAT-M scaled 
score means can provide only a rough evaluation of the scaling results. 

The results of the rank ordering of the scaled score means are presented 
in Table 6. A pragmatic criterion based on a combination of SAT-V and SAT-M 
was formed, and scaled score means obtained on SAT-V and SAT-M were simply 
summed for each group taking a particular Achievement Test. This scaled score 
sum was then used to rank order the thirteen tests from high to low. Scaled 
score means obtained using current scale values and the results of first stage 
scaling, second stage scaling A, single-stage scaling, empirically based 
single-stage scaling, and scaling to the French Test were used to rank order 
the tests and these orders were then compared to the rank ordering obtained 
using the summed SAT-V and SAT-M scaled scores. It can be seen, from 
examination of the information presented in Table 6, that none of the rank 
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orderings associated with the current Achievement Test scale or the 
experimental scaling procedures evaluated exactly reproduce the rank ordering 
that occurs using the summed SAT-V and SAT-M means. 

Insert Table 6 about here 

Some consistency clearly does exist in the rank orderings provided in 
Table 6. For example, Math Level II and Physics are the two top ranked tests 
regardless of scaling procedure and regardless of whether Achievement Test 
score or SAT sum is used. Scaling to the French Test scale definitely 
provides a higher rank ordering for the language tests than the other scaling 
procedures under consideration. This higher ranking for the language tests is 
consistent with the high ranking these tests receive on the SAT sum. It should 
also be noted that both the current scale aad the second stage scaling results 
preserve the rank ordering of the four language tests obtained using the sum 
of SAT means. On the other hand, the remaining scaling procedures place the 
French Test scaled score mean above the German Test mean, which is 
inconsistent with the rank ordering obtained using the sum of SAT m-'ans. 

As a means of assessing the degree of consistency between the rank 
orderings of the scaled score means obtained from the five scaling procedures, 
the rank ordering obtained from the current scale, and the rank ordering based 
on the Slimmed SAT-V and SAT-M means, Spearman rank order correlation 
coefficients were calculated and are reported in Table 6. (Ties were resolved 
by reranking using data to more decimal places than shown in Table 6.) The 
rank order correlation between rank orderings based on summed SAT-V and -M 
means and the current scale is .687. The rank order correlations between the 
single-stage scaling and the empirically based single-stage scaling and the 
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rank ordering based on the sunumed SAT-V and SAT-M means are .538 and .896, 
respectively. The rank order correlations between rank orJerings based on 
summed SAT-V and -M means and the second stage scaling and scaling to the 
French Test are .929 and .926, respectively. Finally, the rank order 
correlation between the first stage scaling results and the SAT-V and SAT-M 
sum is .846. The experimental scaling procedures that use empirically derived 
data for the reference populations appear to provide more consistent orderings 
of means with the ordering provided by the summed SAT-V and -M means than do 
the orderings based on means from the current scale or the single-stage 
scaling procedure. 

It is also interesting to note that the current scale and the single- 
stage scaling procedures have a tendency to rank tests that have scores that 
show a strong relationship to SAT-M scores higher than tests providing scores 
that have a strong relationship with SAT-V scores. This is not surprising 
given that the current SAT-V and SAT-M scales are not well aligned and the 
SAT-M scale is higher than the SAT-V scale. There are, however, some 
exceptions to this rule, particularly the Latin Test. Another point worth 

* 

noting is that all rank orderings, with the exception of that provided by the 
current scale, rank Math Level I as the lowest ranked test. Finally, if the 
language tests are ignored and only the rank orderings of the non- language 
tests are evaluated, it can be seen that the current scale, the second stage 
scaling A, the first stage scaling, the empirically based single-stage 
scaling, and the scaling to the French Test (which provides almost the same 
results as the second stage scaling) result in a rank ordering that is the 
same and consistent with the ranking obtained using the SAT sum for the Math 
Level II, Physics, Chemistry, and European History Tests. On the other hand, 
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ignoring the language tests, the experimental procedures result in a fairly 
different rank ordering of the Math Level I, Biology, Literature, American 
History and English Composition Tests. The current scale and the single- 
stage scaling procedures have a tendency to place Math I higher and the 
Literature and English tests lower than either the SAT sum or the remaining 
scaling results. 

One way to evaluate the alignment of the Achievement Test scales is to 
examine plots of Achievement Test scaled score means, conditioned on the three 
covariates used for the experimental scalings. Recall a definition of scaling 
given by Keeves (1988) in an earlier section of this paper. Keeves quoted 
Howard (1958) as saying; "In practice these requirements [key requirements of 
a scaling procedure] demand that the same mark on different examinations 
should imply the same level of performance relative to a common population." 
The common, or reference population, used for the various experimental scaling 
procedures differed from procedure to procedure, as is illustrated by the 
information presented in Table 2; consequently, it was thought useful to 
evaluate the relationship among scaled score means obtained for the 
Achievement Tests at 100 point intervals along the SAT-V and SAT-M score 
scale. In addition, Achievement Test means were evaluated by examining 
foreign language test means conditioned on semesters of study of a foreign 
language, which ranged from three tc nine semesters. 

The results of these analyses are plotted in Figures la-lo and Figures 
2a- 2e. Figures la-lo contain, for the current scale and each of the 
experimental scaling methods, plots of Achievement Test scaled score means for 
groups of examinees with selected scores on the scaling covariates. Three 
plots appear for each scaling procedure. One plot shows Achievement Test 
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means conditioned on SAT-V scores, the second plot shows Achievement Test 
means conditioned on SAT-M scores, and the third plot shows language 
Achievement Test means conditioned on semesters of study of a foreign 
language . 

Figures 2a-2e are simply a rearrangement of the plots shown in Figures 
la-lo; i.e.. Figures 2a- 2e show plots for all three covariates used for a 
single scaling procedure on the same page and hence are more useful for 
evaluating trends across the covariates. Because the plots provided in 
Figures la-lo are larger, they permit a clearer evaluation of the behavior of 
the individual tests represented by the sjniibols in the plots and, hence, are 
included in the paper. 

Insert Figures la-lo and Figures 2a- 2e about here 

Figures la-lc show plots of Achievement Test means on the current scale 
conditioned on SAT-V, SAT-M, and semesters of study of a foreign language, 
respectively. Examination of the information provided in Figure la indicates 
a considerable spread among all the conditional means, although the spread 
appears to be less in the vicinity of an SAT-V scaled score of 500. If one 
looks only at the grouping of Achievement Test means at an SAT-V mean of 500, 
it is apparent that the Math Level I, Chemistry and Physics Tests form one 
cluster of scores, that the remaining tests, with the exception of Math Level 
II, form a second cluster of means, and that the Math Level II Test provides 
higher scaled scores than any of the ether Achievement Testo. The plots 
provided in Figure lb show somewhat similar results to those given in the 
first figure. It can be seen that the Achievement Test means tend to cluster 
at a scaled score mean of 500 on the SAT-K scale. Again, the Math Level II 
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mean scores appear higher than mean scores obtained on the other Achievement 
Tests. Figure Ic shows the relationship among language test means for the 
four foreign language tests conditioned on semesters of study. It is 
apparent, even after conditioning on amount of training, that there is still 
considerable variability in mean scores, with the Spanish Test consistently 
providing the lowest scores and the Latin Test, the highest. 

The results of the single-stage scaling procedure presented in Figures 
Id-lf can be contrasted to results shown in the previously discussed plots. 
It appears that the Achievement Test conditional means resulting from the 
single-stage scalings are slightly less dispersed than those observed for. the 
current scale for all three covariates. A major difference between the 
information provided in these plots and those provided for the current scale 
is that the Math Level II means, conditioned on SAT-M scores, do not appear as 
high as the means shown for the same test resulting from the current scale. 

Figures Ig-li show conditional Achievement Test scaled score means 
resulting from the first stage scalings. It should be kept in mind that the 
reference populations are now centered at means that are above 500 on SAT-V 
and -M (see Table 2) . An examination of the information provided in Figure Ig 
shows that Achievement Test means conditioned on SAT-V means are fairly 
tightly clustered for mid to upper SAT-V scaled score ranges. Math Level II 
means appear to be more closely related to the means obtained on the other 
tests for this particular scaling procedure than observed for the Level II 
conditional means for the scaling procedures previously evaluated. 
Conditional means provided in Figure Ih, i.e.. Achievement Test means 
conditioned on SAT-M scores, appear to provide a slightly tighter clustering 
of Achievement Test means than the clustering observed in the previously 
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discussed plots that involved conditioning on SAT-M. The plots shown in 
Figure li, which display language test means resulting from the first stage 
scaling conditioned on semesters of study, also indicate a closer agreement 
among these means, particularly for seven semesters of study (the reference 
population mean) than the comparable plots previously evaluated. 

Figures Ij -11 show Achievement Test conditional means resulting from 
scaling to the French Test. Achievement Test means shown in these plots are 
for the foreign language tests only. A comparison of the foreign language 
test means conditioned on SAT-V and SAT-M scaled scores (presented in Figures 
Ij and Ik) show the means to be reasonably clustered at different points on 
the V and M scaled score continuums. A comparison of the information shown in 
Figure 11 with that shown in Figures Ic, If, and li indicates that scaling to 
the French Test has a tendency to cluster foreign language test means, 
conditioned on semesters of study, a little more tightly. 

Figures Im-lo show Achievement Test conditional means resulting from the 
second stage scalings . The plots shown in Figures Im-lo are almost identical 
to those shown for the first stage scaling results. The tighter clustering of 
conditional Achievement Test means , as compared to those obtained by the 
single stage scaling or those for the current scale, is evident. In addition, 
Math Level II scores resulting from the second stage scaling seem to provide 
conditional means closer to the other Achievement Test means than the Level II 
means that resulted from the single-stage scaling or the current scale. 

As mentioned previously. Figures 2a- 2e contain the same plots illustrated 
in Figures la-lo; however, the plots shown in Figures 2a- 2e have been 
condensed so that plots of conditional means for all covariates used for a 
particular scaling method can be shown on a single page. Figure 2a contains 
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plots of conditional means obtained using current scale values. It can be 
seen, from an examination of the plots contained in Figure 2a, that there is 
considerable scatter of Achievement Test means conditioned on SAT-V and SAT-M 
scores, as well as for the language test means conditioned on semeste-rs of 
study. 

Examination of the information provided in Figure 2b, which illustrates 
conditional means resulting from the single stage procedure, shows a slight 
reduction in the scatter of the Achievement Test means compared to those 
obtained using current scale values; at least for those conditioned on SAT-V 
and SAT-M scores. Figure 2c contains plots of Achievement Test means 
resulting from the first stage scaling procedure. This scaling procedure 
appears to have resulted in a noticeable reduction in the scatter of the 
Achievement Tepc means for all three covariates. 

An examination of the plots shown in Figure 2d with the bottom panel of 
Figure 2c permits a comparison of the results of scaling the language tests to 
the French Test scale with those obtained by the alternative first stage 
scaling procedure. It appears that the scatter of foreign language 
Achievement Test means, conditioned on semesters of study, is quite similar 
for the two scaling procedures. 

Finally, the plots shown in Figure 2e illustrate conditional Achievement 
Test means resulting from the second stage scaling. The degree of scatter 
observed in Figure 2e for the Achievement Test means is very similar to that 
observed for the first stage scaling results. 

Another way to evaluate the alignment of the Achievement Test scales 
resulting from application of the experimental scaling procedures is to 
examine the relationship between scaled score means on pairs of Achievement 
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Tests, where each pair was taken by the same group of students. The 
assumption is that, if the students were equally prepared in the subject 
matter tested by each test in the pair, and if the tests were measuring the 
same underlying ability, tests with aligned scales would show similar mean 
scores. Figures 3a- 3d provide bivariate plots of Achievement Test scaled 
score means resulting from the experimental scaling procedures. The data used 
to provide the scaled score values plotted in Figures 3a -3d were obtained from 
a recent administration of the Achievement Tests and are not necessarily 
representative of the groups of students used for the experimental scalings. 



Insert Figures 3a- 3d about here 
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The plots shown in Figures 3a- 3d demonstrate the relationship between 
pairs of Achievement Test scaled scores representing the current scale, first 
stage scaling results, single-stage scaling results and scaling to the French 
Test. (Scaled score means for the non-language tests in the scaling to the 
French Test plot were derived from the results of the first stage scaling of 
these tests.) Points falling closer to the diagonal line on the plots 
represent pairs of Achievement Test means that are in closer agreement with 
each other than pairs represented by points that are farther away from the 
diagonal line. Table 7 provides scaled score values for the points plotted in 
Figures 3a- 3d. In addition. Table 7 provides SAT-V and SAT-M scaled score 
means for the respective groups as well as the correlation of Achievement Test 
scores with SAT-V and SAT-M scores. The particular scaling results used to 
obtain the mean scores for the pairs of Achievement Tests shown in Figures 3a- 
3d and Table 7 were chosen because they represent, to a certain extent, the 
extremes of scale alignment provided by the results obtained in this study. 
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Insert Table 7 about here 

Examination of the information provided in Figure 3a, which shows a plot 
of pairs of Achievement Test scaled score means based on the current scale, 
indicates that test pairs showing the most agreement in scaled score means are 
Biology and Chemistry, English Composition and European History, English 
Composition and Literature, English Composition and, Biology, Biology and 
American History, and English Composition and American History. Points 
falling the farthest from the diagonal line represent the following pairs of 
tests: Chemistry and Math Level I, English Composition and Physics, Latin and 
Math Level I, and Latir and English Composition. The remaining points on the 
plot are somewhat intermediate in agreement. 

The single-stage scaling results are represented by the points plotted in 
Figure 3b. It can be seen, from examination of the points plotted in this 
figure, that points falling closest to the diagonal ]ine are those associated 
with the following tests: English Composition and Biology, French and American 
History, and Biology and American Hi§itory. Points representing test pairs 
falling the farthest from the diagonal are English Composition and French, 
English Composition and Latin, English Composition and Physics, and Chemistry 
and Math Level I . 

Figure 3c contains a bivariate plot of scaled score means resulting from 
application of the first stage scaling procedure. It can be seen from an 
examination of this fij.ure that points falling closest to the diagonal line 
represent the following test pairs: English Composition and Literature, 
English Composition and Chemistry, French and American History, Biology and 
Chemistry and Biology and Math Level I. Points falling farthest away from the 
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diagonal line in Figure 3b are those representing English Composition and 
French, English Composition and Latin, English Composition and Physics, and 
Chemistry and Math Level I. 

Finally, the results of scaling to the French Test are shown by the 
points plotted in Figure 3d. It should again be noted that values presented 
in this plot and in Table 7 for the non- language tests are those obtained from 
the first stage scaling of these tests. Examination of the information 
provided in Figure 3d indicates that pairs of tests falling close to the 
diagonal line are English Composition and Literature, English Composition and 
French, English Composition and Chemistry, French and American History, Latin 
and Math Level I, and Biology and Math Level I. Pairs of tests falling the 
farthest from the diagonal are English Composition and Latin, English 
Composition and Physics, French and Math Level I, and Chemistry and Math 
Level I . 

While it is difficult to draw conclusions regarding the different scaling 
procedures from an examination of the data provided in the four figures , a few 
generalizations can be made. For one, the scaling procedure resulting in the 
largest number of points (test pairs) falling close to the diagonal line was 
the procedure that involved scaling to the French Test scale. The procedure 
that resulted in the fewest number of pairs falling close to the diagonal line 
was the single-stage scaling procedure. Secondly, it seems reasonable to 
expect tests such as, for example, English Composition and Literature to 
provide scaled score means somewhat similar for a group of examinees taking 
both tests if the test scales were aligned. For this test pair, this is 
clearly the case for the means resulting from the current scale and from all 
experimental scaling procedures with the exception of the single-stage scaling 
procedure . 
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Three Achievement Test pairs provided results that were consistently 
discrepant regardless of whether the current scale or an experimental scaling 
procedure was used. These test pairs were English Composition and Latin, 
English Composition and Physics, and Chemistry and Math Level I. One would 
hardly expect the English Composition and Physics Tests to provide scores 
measuring a single underlying skill or ability and hence, similar scaled score 
means for the same group of students. However, it might be expected that 
tests such as Chemistry and Math Level I share a sufficiently common base of 
knowledge or skills that the test scores obtained on these tests by the same 
group of students should be somewhat similar. Examination of the correlation 
coefficients supplied in Table 7 for this pair of tests indicates that, 
although the tests are reasonably correlated with each other, they show a 
differential correlation with the scaling covariates. Math Level I is much 
more highly correlated with SAT-M scores than scores obtained on the Chemistry 
Test. This differential correlation clearly appears to be reflected in the 
pair of Achievement Test scaled score means and is an indication that even 
when Achievement Tests share some commonality in the skills they measure, they 
may not share a similar relationship with the scaling covariates, thus 
complicating even further the scaling process. 

DISCUSSION AND CONCLUSIONS 
The purpose of this study was to evaluate an experimental scaling 
procedure suggested in an earlier study by Cook (1988) . The proposed 
experimental scaling method Involved the use of a two stage scaling procedure. 
The first stage attempted to scale tests by cluster (language versus 
non- language tests) and to maximize the alignment of scales of tests within a 
cluster. The second stage of the procedure was carried out in an attempt to 
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align scales for tests across clusters. The first stage scalings of the tests 
in the language test cluster were based on two different procedures. One 
procedure used, as reference group values for the scaling covariates, data 
aggregated across all the language tests in the cluster. The alternative 
first stage scaling procedure consisted of scaling the German, Latin, and 
Spanish Tests to the French Test scale. In addition to the two stage scaling 
procedure, which was the focus of this study, a single-stage procedure and an 
empirically based single-stage procedure were also used for comparative 
purposes . 

Finally, the effect of changing the procedure used to select scaling 
samples for the Biology and Chemistry Tests was evaluated. The first and 
second stage experimental scalings included high school sophomores in the 
scaling samples for these two tests, while the two single-stage scaling 
procedures did not. 

The results of the two stage procedure were somewhat disappointing in 
that, in almost all cases, the results of the first stage scaling (scaling 
within cluster) and the results of the second stage scaling (scaling across 
clusters) were very similar. The reason for this is fairly clear. If one 
examines the information provided in Table 2, which shows values of scaling 
covariates used for the reference groups for the first and second stage 
scalings, it is apparent that these values change very little from first to 
second stage and hence have little effect on the scalings. The only exception 
to this statement is the reference group mean of SAT-V scores obtained for the 
language test cluster. 

As mentioned previously, the reason the reference group values change so 
little from the first to the second stage scaling is due to the fact that the 
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groups that take the tests (language versus non-language) are similar in 
ability as assessed by SAT-V and SAT-M scores and also, for the second stage 
scalings, the non-language group, because it contains more tests and the tests 
are given to more students, has a greater influence on the reference group 
values determined by aggregating across clusters. 

Although the first stage results derived from the separate scalings of 
the language and non- language tests did not differ much from the second stage 
scaling results, the results of scaling the language tests to the French Test 
scale did provide results differing from second stage scaling B results. The 
interesting thing about these results is that the second stage scaling B 
results, the second stage scaling A results, and the first stage scaling 
results all agree fairly well with each other and disagree v/ith the results 
obtained by scaling the language tests to the French Test scale. 

A second aspect of the study, varying the manner in which the scaling 
samples are selected for the Biology and Chemistry Tests by including high 
school sophomores in the scaling samples, can be evaluated by examining the 
information provided in Table 3. Perusal of this information indicates that 
the addition of sophomores to the samples had very little effect on the 
summary statistics for the covariates (SAT-V and SAT-M scores) used in the 
scaling procedures. The addition of sophomores to the scaling sample did, 
however, affect the summary statistics obtained for the respective Achievement 
Tests. Also, it should be noted that addition of sophomores to the scaling 
sample for the Biology Test had a tendency to lower the correlations between 
the scores on the two covariates and Achievement Test score, while, addition 
of sophomores to the scaling sample for the Chemistry Test had no effect on 
these correlations. 
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A useful framework for evaluating the results of the current study is 
provided by statements made by both Angoff (1968) and Keeves (1988) about the 
desired outcomes of the scaling process. Recall, Angoff stated the desired 
outcome of a scaling procedure is to ensure that candidates choosing to 
compete with more able candidates will not be put at a disadvantage. 
According to Angoff, "...a candidate who is average in a highly selected group 
of candidates should earn a higher scaled score than a candidate who is 
average in a less able group." According to Keeves (1988) one of the key 
requirements of moderation [scaling] procedures is to ensure that, "...the 
same mark on different examinations should imply the same level of performance 
relative to a common population." With Angoff 's and Keeves' statements of 
desired outcomes of the study in mind, it is useful to focus on the 
information provided in Tables 6 and 7 and Figures la-lo, 2a- 2e, and 3a- 3d. 

The information provided in Table 6 is an attempt to use Angoff s 
definition of a desired outcome of a scaling method to evaluate the 
experimental scaling procedures. Using Angoff s definition, if the covariates 
used for the scalings are measures of the abilities underlying scores on the 
thirteen tests and the covariates are related to the tests in a similar 
manner, one would expect the rank ordering of groups taking the tests to be 
similar to a ranking obtained using scores on these covariates. An 
examination of the information provided in Table 6 shows that the scaling 
procedures do effect the rank orderings of the tests and that the procedures 
employing reference group values that are empirically based provide rank 
orderings that are more similar, when compared to rankings obtained using the 
sum of SAT-V and SAT-M means, than are rank orderings produced by the current 
scale and by the single-stage scaling procedure which employs a reference 
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group with a mean and standard deviation of SAT-V and SAT-M scores of 500 and 
100, respectively. The information provided in Table 6 indicates that the 
rank order correlation coefficient is somewhat affected by whether or not 
results from the first stage scaling procedure are used or results provided by 
the additional scaling across the language and non-language clusters 
represented by the second stage scaling procedure are used. It appears as 
though closer agreement with the ranking obtained on the sum of SAT-V and 
SAT-M means is realized by Achievement Test means resulting from the second 
stage scaling. 

Although the effect on rank order correlation coefficients of the various 
scaling procedures is apparent, interpretation of these effects is not so 
clear. Recall, the assumption is that the scaling covariates , SAT-V and 
SAT-M, measure the same underlying abilities as measured by the thirteen 
Achievement Tests used in this study and that the relationship of the 

wariates with the Achievement tests is similar across tests. Dramatic 
evidence that this assumption is not true is presented by the correlation 
coefficients given in Table 5. As noted previously, the thirteen tests show 
very different patterns of correlations with the covariates. In addition, the 
results of the rankings of the Achievement Tests must be interpreted with 
caution since in a number of instances very small differences between 
Achievement Test means result in different rank orderings of the tests. 
Keeping the above mentioned caveats in mind, if one were to use agreement in 
rank ordering of means between a particular scaling procedure and the 
covariates as a criterion in the choice of a scaling procedure, it appears 
that either the second stage scaling procedure used with all tests or the 
empirical single stage scaling used for the non- language tests coupled with 
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the second stage scaling used with the language tests would be the procedures 
of choice, 

A second way to evaluate the experimental scaling procedures used for 
this study is to employ Keeves' key requirement for scaling; i.e., that 
scores on the Achievement Tests should imply the same level of performance 
relative to a common population. The best information to use to evaluate the 
tests from this point of view is the information provided in the plots shown 
in Figures 2a- 2e. An examination of the plots shown in these figures 
indicates that both the first stage scaling and second stage scaling 
procedures reduce the scatter of Achievement Test conditional means when 
compared to that observed for the single-stage procedure and the current 
scale. For all of the procedures represented in the plots shown in Figures 
2a-2e, the scatter of Achievement Test conditional means is less as one 
approaches the value of the covariate mean of the reference population. 

The fact that scatter of the conditional Achievement Test means plotted 
for extreme values of the covariates is much greater than that observed for 
values surrounding the reference population mean is understandable given that 
the experimental scaling procedures are all linear scaling procedures. Some 
type of non- linear scaling procedure would necessarily need to be used to 
maintain a similar level of agreement among Achievement Test conditional means 
throughout the entire range of covariate scores . 

It is important to note again, that one would not expect exact agreement 
among Achievement Test means conditioned on a particular covariate value 
unless the covariate is measuring the same underlying ability for all the 
tests and is related to the tests in a perfect manner. As mentioned 
previously, there is clear evidence given in Table 5 that this situation is 
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not met. However, keeping the data presented in Table 5 in mind, one could 
interpret the results provided in Figures 2a- 2e as providing some indication 
that the first and second stage scaling procedures are somewhat more 
successful in aligning Achievement Test means than is a procedure that is 
based on a reference population with scaled score means and standard 
deviations of 500 and 100 on SAT-V and SAT-M. 

The information regarding the relationship of Achievement Test means for 
groups taking pairs of the tests, presented in Figures 3a-3d and Table 7, is 
very difficult to interpret. This may be because the covariate means of the 
particular group taking a pair of Achievement Tests are very different from 
those specified for the reference groups used for the scaling procedures as 
well as being very different from pair to pair. As mentioned previously, it 
appears as though the procedures based on empirically derived scores on the 
covariates for the reference populations result in pairs that are in slightly 
closer agreement than a procedure that involves using values of 500 and 100 
for the reference population scaled score means and st lard deviations. 

The question must be asked, even if scales for, say, the Biology Test and 
Math Level I Test were perfectly aligned, given that the tests measure 
different skills and are selected by examinees for different reasons, should 
one expect scores on these tests for the same group of examinees to be the 
same? The answer to this question is probably no. Thus, examining the 
relationship among means obtained on pairs of Achievement Test scores is 
probably not the most effective way of choosing one scaling procedure from a 
group of potential scaling procedures. 

An important point to note is the dramatic effect that the use of 
empirical values for reference group scores on the scaling covariates has on 
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the placement of the Achievement Tests on the 200 to 800 scale. Achievement 
Test scaled score means obtained for the experimental procedures employing 
empirical values for the reference group and Achievement Test means obtained 
employing the current scale are very different (see Table 5). Also, the 
relationship between scaled score means obtained on SAT-V and SAT-M and the 
Achievement Tests, for the respective groups taking the tests, is quite 
different depending upon whether or not empirical values are used for 
reference group covariate scores. 

Examination of the information provided in Table 5 indicates that use of 
empirical valvies for the referer'ce group results in a much lower placement on 
the 200 to 800 scale for Achievement Test means relative to the current scale. 
This is due to the fact that the empirically defined reference population is a 
very able group, as assessed by SAT-V and SAT-M scores. Estimating scores for 
this group on the respective Achievement Tests and subsequently placing these 
scores on the current scale results in the low scale placement of Achievement 
Test means observed in Table 5 for the empirically based scaling procedures. 
It should be recalled, however, that the primary purpose of the scaling 
procedures evaluated in this study is to promote comparability of scales for 
the thirteen Achievement Tests, not necessarily to promote comparability of 
Achievement Test scales with the current scale. Thus, lower scale placement 
of scores due to the use of observed reference group values is not necessarily 
a disadvantage of the empirically based procedures . 

To summarize, the purpose of this study was to evaluate whether an 
experimental scaling procedure that employed observed reference group values 
for the covariates (SAT-V and SAT-M) would provide better results than a 
procedure that used as reference group values scaled score means of 500 and 
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standard deviations of 100. In addition, the feasibility of scaling in two 
stages was evaluated. The purpose of the first stage scaling was to align 
scales within clusters of tests; i.e., language versus non- language 
Achievement Test clusters, and the purpose of the second stage scaling was to 
align score scales across the two clusters. Finally, the effect of augmenting 
the scaling samples used for the Biology and Chemistry Tests by adding high 
school sophomores to these samples was evaluated. 

The results of the study related to the sampling question indicate that 
addition of high school sophomores to the samples does not improve the 
relationship between Achievement Test scores and the scaling covariates (at 
least as evaluated by the correlations between Achievement Test score and the 
covariates) and thus is probably not an appropriate change to considc"'. The 
results of the study related to the investigation of the two stage scaling 
procedure indicate there is some reasonable evidence to suggest that the use 
of empirical values for the reference group and use of the procedure that 
involves scaling the tests in two stages may improve the alignment of the 
Achievement Test scales. A viable alternative involves application of the 
empirical single-stage scaling to the non-language tests and the two-stage 
scaling to the language tests. This combination of procedures appears to 
provide a comparable degree of alignment of the scales as that provided by 
applying the two stage procedure to all tests, and should be somewhat easier 
to implement. 

The results of the study should be interpreted with caution because of 
the circular nature of the criterion. In other words, the requirements of the 
scaling were specified, a scaling procedure was developed based on these 
requirements, and the criterion used to evaluate the results of the scalings 
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was based on the same requirements. As mentioned several times, the scaling 
procedures, as well as the criterion used to evaluate the procedures, were 
based on an untenable assumption; i.e., that the covariates measured the same 
underlying abilities for all the tests and that the relationship between the 
covariates and the Achievement Test was similar for all tests. This 
assumption is clearly impossible to meet. Given this situation, the best that 
can be expected is a rough alignment of the test scales. This rough alignment 
does appear to be provided by the two stage procedure evaluated in this study. 

It is important that the results of this study be considered tentative 
and that further research be carried out. Procedures that involve modeling 
and correcting for the selection bias present in the Achievement Test scores 
might prove fruitful. In addition, use of non- linear scaling procedures that 
should provide improved alignment for scores throughout the entire scaled 
score range for the tests might also be desirable and should be further 
investigated. 
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Table 1 

Achievement Test Background Questionnaire Used 
to Collect Covarlate Inforaatlon 



French Test 



In the group of nine spaces labeled Q, you are to blacken ONE and ONLY ONE space, 
as described below, to indicate how you obtained your knowledge of French. The 
information that vou provide is for statistica l T)urposes only and will not 
influence vour score on the test . 



Question 1 

If yotir knowledge of French does 
not come primarily from courses 
taken in grades 9 through 12, 
blacken space 9 and leave the 
remaining spaces blank, regard- 
less of how long you studied the 
subject in school. For example, 
you are to blacken space 9 if 
your knowledge of French comes 
primarily from any of the 
following itources: study prior 
to the ninth grade, courses 
taken at a college, or special 
study, residence abroad, or 
living in a home in which French 
is spoken. 

Level I: first or second half 

Level II: first half 

second half 
Level III: first half 

second half 
Level IV: first half 

second half 



If your knowledge of French does 
;:ome primarily from courses taken 
in grades 9 through 12, blacken 
the space that Indicates the 
level of the French course in 
which you are currently enrolled. 
If you are not now enrolled in a 
French coxirse, blacken the space 
that indicates the level of the 
most advanced course in French 
that you have completed. 



- blacken space 1 

- blacken space 2 

- blacken space 3 

- blacken space 4 

- blacken space 5 

- blacken space 6 
• blacken space 7 

- blacken space 8 
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Advanced Placement or course 
that represents a level of 
study higher than Level IV: 
second half 

If you are in doubt about whether to mark space 9 rather than one of the spaces 
1-8, mark space 9. 

^The same questionnaire (with the appropriate test name) appears in the French, 
German. Latin and Spanish Tests. The Latin questionnaire differs slightly m 
that the phrase, "...or living in a home in which (language] is spoken" is 
eliminated. 
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Table 4 

Results of Application of Experimental Scaling 
Procedures to Selected Scale Score Points 



ENGLISH COMPOSITION 

Emp. Based 



Current 


First Stage 


Second Stage 


Single -Stage 


Single -Stage 


Scale 


Scaling 


Scaling 


Scaling 


Scaling 


800 


781 


782 


825 


781 


750 


731 


732 


773 


731 


700 


681 


681 


720 


681 


650 


631 


631 


668 


631 


600 


581 


581 


615 


581 


550 


531 


531 


563 


531 


500 


481 


481 


511 


481 


450 


431 


431 


458 


430 


400 


381 


380 


406 


380 


350 


331 


330 


354 


330 


300 


281 


280 


301 


280 


250 


231 


230 


249 


230 


200 


181 


180 . 


196 


180 






LITERATURE 














Emp. Based 


Current 


First Stage 


Second Stage 


Single -Stage 


Single-stage 


Scale 


Scaline 


Scalinif. 


Scaling 


Scalinp 


800 


773 


773 


801 


773 


750 


725 


724 


751 


724 


700 


676 


675 


701 


675 


650 


627 


627 


651 


627 


600 


578 


578 


601 


578 


550 


530 


529 


551 


529 


500 


481 


480 


501 


480 


450 


432 


431 


451 


431 


400 


383 


382 


401 


382 


350 


335 


334 


351 


334 


300 


286 


285 


301 


285 


250 


237 


236 


251 


236 


200 


189 


187 


201 


187 
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Table 4 (cont.) 

Results of Application of Experimental Scaling 
Procedures to Selected Scale Score Points 



AMERICAN HISTORY 

Emp . Based 



Current 
Scale 


First Stage 
Scaline 


Second Stage 
Scalinp 


Single -Stage 
Scaline 


Single-Stage 
Scaling 


800 


774 


774 


811 


774 


750 


723 


723 


759 


723 


700 


672 


672 


707 


672 


650 


621 


621 


655 


621 


600 


571 


570 


604 


570 


550 


520 


519 


552 


519 


500 


469 


468 


500 


468 


450 


418 


417 


448 


417 


400 


367 


366 


396 


366 


350 


316 


315 


345 


315 


300 


265 


264 


293 


264 


250 


214 


213 


241 


213 


200 


163 


162 


189 


162 



EUROPEAN HISTORY 

Emp. Based 

Current First Stage Second Stage Single-stage Single-stage 



Scale 


Scalinif 


Scaline 


Scaling 


Scaling 


800 


787 


787 


818 


786 


750 


734 


734 


765 


734 


700 


682 


682 


711 


682 


650 


630 


629 


658 


629 


600 


577 


577 


604 


577 


550 


525 


524 


551 


524 


500 


473 


472 


497 


472 


450 


420 


420 


444 


420 


400 


368 


367 


390 


367 


350 


316 


315 


337 


315 


300 


263 


262 


283 


262 


250 


211 


210 


230 


210 


200 


159 


157 


176 


157 



64 
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Table 4 (cont.) 

Results of Application of Experimental Scaling 
Procedures to Selected Scale Score Points 



MATHEMATICS LEVEL I 



Current 
Scale 


First Stage 
Scaling 


Second Stage 
Scaling 


Single -Stage 
Scaling 


Emp . Based 
Single -Stag 
Scaling 


800 


752 


753 


835 


753 


750 


699 


700 


780 


700 


700 


647 


647 


725 


647 


650 


594 


595 


671 


595 


600 


541 


542 


616 


542 


550 


489 


489 


561 


489 


500 


436 


436 


506 


436 


450 


383 


383 


452 


383 


400 


331 


331 


397 


331 


350 


278 


278 


342 


278 


300 


226 


225 


288 


225 


250 


173 


172 


233 


172 


200 


120 


119 


178 


119 



MATHEMATICS LEVEL II 



Current 
Scale 


First Stage 
Scaling 


Second Stage 
Scaline 


Single-stage 
Scaling 


Emp . Based 
Single-stage 
Scaling 


800 


699 


700 


773 


700 


750 


648 


648 


722 


648 


700 


596 


597 


670 


597 


650 


545 


545 


619 


546 


600 


494 


494 


568 


494 


550 


443 


443 


516 


443 


500 


391 


391 


465 


391 


450 


340 


340 


413 


340 


400 


289 


288 


362 


288 


350 


237 


237 


310 


237 


300 


186 


185 


259 


185 


250 


135 


134 


207 


134 


200 


83 


82 


156 


83 
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Table 4 (cont.) 

Results of Application of Experimental Scaling 
Procedures to Selected Scale Score Points 



Current 
Scale 

800 
750 
700 
650 
600 
550 
500 
450 
400 
350 
300 
250 
200 



First Stage 
Scaling 

744 
697 
650 
602 
555 
508 
460 
413 
366 
319 
271 
224 
177 



BIOLOGY 



Second Stage 
Scaling 

744 
697 
649 
602 
555 
507 
460 
413 
365 
318 
270 
223 
176 



Single -Stage 
Scaling 

792 
744 
696 
648 
599 
551 
503 
454 
406 
358 
309 
261 
213 



Emp . Based 
Single-stage 
Scaling 

744 
697 
650 
602 
555 
507 
460 
413 
365 
318 
270 
223 
176 



CHEMISTRY 



Current 
Scale 


First Stage 
Scaling 


Second Stage 
Scaling 


Single-stage 
Scalinp 


Emp. Based 
Single-stage 
Scaling 


800 


744 


745 


801 


745 


750 


696 


697 


753 


697 


700 


648 


648 


704 


649 


650 


600 


600 


656 


600 


600 


552 


552 


608 


552 


550 


504 


504 


560 


504 


500 


456 


456 


511 


456 


450 


408 


408 


463 


408 


400 


360 


360 


415 


360 


350 


312 


312 


366 


312 


300 


264 


264 


318 


264 


250 


216 


216 


270 


216 


200 


168 


168 


221 


168 
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Table 4 (cont.) 

Results of Application of Experimental Scaling 
Procedures to Selected Scale Score Points 



PHYSICS 



Current 
Scale 

800 
750 
700 
650 
600 
550 
500 
450 
400 
350 
300 
250 
200 



First Stage 
Scaling 

743 
696 
648 
600 
553 
5C5 
457 
410 
362 
314 
267 
219 
172 



Second Stage 
gcalj-ng 

744 
696 
648 
601 
553 
505 
457 
410 
362 
314 
266 
219 
171 



Single -Stage 
Scaling 

806 
757 
708 
659 
610 
561 
511 
462 
413 
364 
315 
266 
216 



Emp. Based 
Single -Stage 
Scaling 

744 
696 
648 
601 
553 
505 
^ 457 
410 
362 
314 
266 
219 
171 



ERIC 



6V 



- 58 - 



CP 



0) 

"O 60 ^ 

0) (0 bl 

0) iJ C 

0) (0 
• r-l O 

e c 

Cd 1-1 



0) 

•O hO , 
0) (0 bl 
01 iJ C 
(0 W 1-1 

ta • 

0) (0 
• r-t O 
Ou COM 

e c 

W i-< 
CO 



OOnOCC*ir»»CMr^CMvX5r-IVDr-lvX5 



4J 

c 
o 
u 



00 
C 



(0 

u 



ij 
c 

1-1 

0) 

o. 

X 
Cd 

(4-1 
O 

c 
o 



0) 

C 

O 

0) 
O 

u 



o 
•o 

Q) 

U 
0) 



< 0) 

o ^ 

si 

0) 

oi 



33 

o 

Cd 



0) 

60 ^ 
(0 61 

V c 
w 

I r-l 

0) (0 

r-l O 

60 M 

c 

1-1 



0) 

60ca 

iJ 61 
M C 

1-1 

•O r-l 

C (0 
o u 

U M 
0) 



0) 
60 < 
Q] 

iJ 61 
M C 
1-1 
•O r-l 
C (0 

o u 

U M 
0) 



4-> U 
0) 

60H 

r-l O 

CO C 
U 0) 



0) 

60 , 

(0 U 

•U C 

M 

ij (d 

(A u 

u w 



CM-d'VOOOOCM-tfVDOOOCM-tfVD 

Ou-iOinr-ivcir-ivOt-ir^cvir^c^ 



OCM«tfVD00OCM«tfir>r»0^r-IC*1 

r>»CMr^e^r^«^ooc*iooc*ico>*0\ 
r^r'-vovoinu-i^^ncncMCMr-i 



OCM«*voooocM«!tinr^a>r-icn 
r>»CMr»CMr»c*iooc*iooenoo«tfo^ 
r«.r»vDvoinin^«d'c*ic*icMCMr-i 



ooooooooooooo 
Oir>Oir>oirioiriOiriOirio 



cgcninvDooa>r-icMvJir>vDooa> 
r>»cMr^cMr^eMooc*ioonooc*ico 



11 

60 ^ 

(0 61 

U C 

W 1^ 

I f-l 

0) 0) 
r-l O 
60 M 

c 

1-1 



0) 
60 CQ 

* ^ 
iJ 61 
W 

•O r-l 
C Q) 

o u 

U M 
0) 
M 



0) 
60< 
(0 

iJ 61 
M C 
1-1 
•O r-l 

C (0 

o u 

U M 
0) 
M 



O iJ 

4-> (A 
0) 

60f-l 

c 

•r^ X. 
r-l O 

(0 c 

U 0) 



0) 

60 , 

(0 61 

U C 

in 

iJ (0 

CO u 

w in 



ir>c*ir-io^r^ir>c*ir-io\r^ir>CMO 



tncMO^r^>*r-iooir>coor^>*r-i 
0\«*oomoonr^cMr^cMvDf-ivo 
r~.r~.vDvDir>ir>^^c*ic*icMCMr-i 



u->cMa>r^«*r-ioovoc*ior^ir>cM 
0\^oonooc*ir^CMr^CMvDr-ivD 



r-IVDOiriO^'tfO^'^OO'^'^^'^ 



ir>c*iOooir>c*iOoo 
oornoocMr^cvir^r-i 



ID n i-< 03 VD 
VD r-l VD O ir> 
n C*1 CM CM 



C 0) 



ooooooooooooo 
Oir>OiriOir>OinOir>Oir>0 



C <l> 
O 



OOOOOOOOOOOOO 
Oir>omOir>0>riOiDOtDO 
oor«-r»vDvDir>ir>^«tfmc*icMCM 



ERIC 



CD 



- 59 - 



0) 

•a to 

0) OS 
U 4J 
09 CO 
tf) < 
0) 
• f-l 
pu bO 
B G 
W 

CO 



C 



<M«*t^OrovDOO»-l<J-t^O<Min 

inOl/^i— IV£)r-IV£)CMr~~CM00rO00 



0) 

•O to 

0) 09 
U 4J 
09 CO 

0) 
• f-l 
pu bO 
S C 
W 

CO 



{y\^0\moomoo<M 



in m .-I 00 

CM CM VD 

m m CM CM «-! 



4J 

C 
o 

■J 

^ — ' 
0) 
H 



bO 
C 

w o 

i s 

0) 
Cu 



0) 



«-4 

o 

c 
o 

•H 

■u 

09 

o 



a. 
a. 
< 

o 

4J 
.-I 

tn 
« 



•O 

a> 

4J 

u 

0) 

t-t 

0) 
CO 

o 

4J 

CQ 
0) 

I 

0) 

o 
o 
u 



0) 




bO 




09 


bl 


■U 


c 


CO 


•H 




r-l 


0) 


09 


r-H 


O 


bO CO 


c 








CO 




(D 




bOtf) 


tS 




4J 


bl 


CO 


C 




•H 




r-H 


•s 


09 


o 


O 


O 


CO 


0) 




CO 




0) 




bO 


< 


09 




■U 




CO 


c 




fi 




r-l 


•B 


09 


o 


O 


o 


CO 


0) 




to 





O 4J 
4J U 
0) 

bO H 

»-i o 
td C 

U 0) 
CO >-l 



0) 




bO 




09 


bl 


4J 


C 


CO 


U 


4J 


09 


tn 


O 


u 


CO 






b 





oo«*o»-^ONmoinT-(vOT-ir^eM 



tn«-ivD«-it^eMi^moomo\<a'0 



t>.«-i<tooc>Jirio\cv4v£5a>nvoo 



CO 
M 

I 

CO 



t^cooomoorooorooorooomoo 



inoinOvD«-ivD»-ivD<Mt^<Mt^ 



0) 

bO , 
09 bl 
AJ C 
CO 

0) Q) 

»-i o 
bO CO 
(= 

CO 



0) 

bO tf) 

4J til 
CO C 

•O 

C 09 

o o 
u CO 

0) 
CO 



0) 
bO < 
« 

U bl 

CO C 

•a r-1 

C to 

o u 

o CO 

0) 
CO 



O AJ 
4J U 
0) 

bOH 
C 

t-l O 
09 C 
U 0) 

to >-i 
b 



0) 

bO , 

09 bl 

4J C 
CO 

4J <0 

U U 

>-l to 



<tCMOOOV£)<tCMOOOr~~mCOr-l 

{y\<j-{y\moomoomt^<Mt^<Mi^ 



^eMOoovD»tfe>aooor^»nm«— I 



^1 — fi-iONir>r-ir--mcT>u^»-ir^'^ 



o\^oomoomt^eMr^eMvD«-i 



C <0 
u * 
3 CO 



ooooooooooooo 
omoirioirioinoirioirio 



C 0) 

3 CO 
CJ 



OOOOOOOOOOOOO 
OinomOinOiriOinOino 




Table 5 

Summary Statistics Resulting from 
Application of Experimental Scaling Parameters 



ENGLISH COMPOSITION (n=216,735) 



Achievement Test Information 



Current 
Scale 



First Stg Second Stg Single -Stg 
Scaling Scaling Scaling 



Emp . Based 
Single-Stg 
Scaling 



SAT-V SAT-M 



Mean 



518 



499 



499 



530 



499 



514 



576 



s.d. 



99 



99 



100 



104 



100 



101 



102 



r(ACH.SATV) 



,78 



rCACH.SATM) . 55 



LITEEIATURE (n=25,006) 



Achievement Test Information 



Current First Stg Second Stg Single-Stg 
Scale Scaling Scaling Scaling 



Emp . Based 
Single-Stg 
Scaling 



SAT-V SAT-M 



Mean 



528 



508 



507 



529 



507 



527 



545 



s.d. 



103 



101 



101 



103 



101 



103 



102 



r(ACH,SATV) .83 



r(ACH,SATM) 



,53 



72 

ERIC 
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Table 5 (cont . ) 

Sununary Statistics Resulting from 
Application of Experimental Scaling Parameters 



AMERICAN HISTORY (n=47,639) 



Achievement Test Information 



Current 
Scale 



First Stg 
Scaling 



Second Stg 
Scaling 



Single-Stg 
Scaling 



Emp . Based 
Single-Stg 
Scaling 



SAT-V SAT-H 



Mean 



528 



497 



496 



529 



496 



515 



557 



s.d. 



97 



99 



99 



100 



99 



99 



102 



rCACH.SATV) 



,75 



rCACH.SATM) 



55 



EUROPEAN HISTORY (n=3,785) 



Achievement Test Information 



Current 
Scale 



First Stg 
Scaling 



Second Stg 
Scaling 



Single-Stg 
Scaling 



Emp . Based 
Single-Stg 
Scaling 



SAT-V SAT-M 



Mean 



547 



522 



521 



547 



521 



554 



562 



s.d. 



95 



100 



100 



102 



100 



102 



103 



r(ACH,SATV) .67 



r(ACH,SATM) 



.44 
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Table 5 (cont.) 

Summary Statistics Resulting from 
Application of Experimental Scaling Parameters 



MATH I (n=155,671) 



Achievement Test Information 



Current First Stg Second Stg Single -Stg 
Scale Scaling Scaling Scaling 



Emp. Based 
Single-Stg 
Scaling 



SAT-V SAT-M. 



Mean 



543 



481 



481 



553 



481 



496 



557 



s.d. 



90 



95 



95 



99 



95 



98 



95 



rCACH.SATV) 



.48 



rCACH.SATM) .82 



MATH II (n=54,787) 



Achievement Test Information 



Current First Stg Second Stg Single-Stg 
Scale Scaling Scaling Scaling 



Emp . Based 
Single-Stg 
Scaling 



SAT-V SAT-M 



Mean 



660 



555 



556 



629 



556 



545 



646 



s.d. 



85 



87 



87 



87 



87 



105 



82 



rCACH.SATV) .43 



rCACH.SATM) 



,78 



ERIC 
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Table 5 (cont.) 

Summary Statistics Resulting form 
Application of Experimental Scaling Pararaenters 



BIOLOGY (n=23,634) 



Achievement Test Information 



Current First Stg Second Stg Single-Stg 
Scale Scaling Scaling Scaling 



Emp. Based 
Single-Stg 
Scaling 



SAT-V SAT-M 



Mean 



540 



498 



497 



541 



497 



514 



573 



s.d. 



107 



101 



101 



103 



101 



104 



102 



r(ACH,SATV) 



.70 



rCACH.SATM) 



.62 



CHEMISTRY (n=29,238) 



Achievement Test Information 



Current First Stg Second Stg Single-Stg 
Scale Scaling Scaling Scaling 



Emp. Based 
Single-Stg 
Scaling 



SAT-V SAT-M 



Mean 



572 



525 



525 



581 



525 



525 



624 



s.d. 



102 



98 



98 



98 



98 



108 



93 



rCACH.SATV) 



58 



rCACH.SATM) 



.65 



ERIC 



- 64 - 



Table 5 (cont.) 

Summary Statistics Resulting from 
Application of Experimental Scaling Parameters 



PHYSICS (n-18,415) 



Achievement Test Information 



Emp . Based 

Current First Stg Second Stg S ingle -Stg S ingle -Stg 
Scale Scaling Scaling Scaling 



Scaline SAT-V SAT-M 



Mean 



594 



547 



547 



604 



547 



536 



653 



s.d. 



97 



93 



93 



96 



93 



108 



84 



rCACH.SATV) .51 
rCACH.SATM) .64 



ERIC 



7g 



- 65 - 



*> 3 

to -p 
0) to 

a 

O 



U >■ 

© T3 

i> 3 
to 
© 

n o 



1> 



5 



< 





e 




0 




U 
IM 




IX) 




5 








4J 




-i 




3 




(0 


B 


0) 


o 


Cti 


o 






10 




o 


•n 


-H 




43 




10 






S> 


4> 


(0 


10 


EH 


4> 








>^ 




U 




10 







m 
u 



5- 











00 




0) 


to 




M 


43 


OC 


(0 


CO 


B 


(Q 


1 


-H 


0) 








to 




to 


o 


i 


C 












xn 












CO 














oc 




C/j 


B 




1 








to 




ISO 


u 




B 
























oo 


CQ 




























•a 






B 


to 










o 


OQ 




» 


















00 


< 




to 






43 


ot 






B 






-H 




T3 






B 


« 




o 


O 




o 


OQ 




» 












43 






U 








0< 




EH 


c 






■H 




J3 


l-l 




o 


10 




B 


o 




O 


OQ 




M 












O 






00 






to 






43 


B 






•H 






r-4 




43 


ID 




to 


U 






OQ 
















43 








ffi 






^ 




u 


to 




u 


o 




3 


OQ 




■u 





55 







to 
























^ 0) 




«o o 




C CO 








CO 




<D 




OO 




OJ 




jj e{ 




CO C 




1 -H 


CO 


o — < 




r-l (0 


m 


00 o 




S CO 




-H 




CO 




0) 




00 CQ 












CO C 






-a- 






C «J 


tn 


o u 




O CO 




a> 




CO 




0) 




00 < 




« 




■p ttf 




CO s 






-a- 




•H 




tn 


o o 




O CO 








CO 




43 




ul 








EH C 










oa 




tn 


C o 




0) CO 




u 




Cl4 




0) 




DO 




a) tkC 




+> c 




CO 


m 


»-l 


o 


+> « 


»n 


M O 




U CO 












43 




C <D 




0) ^ 




H (0 




M U 


m 


=t CO 




U 





c 

10 

ID 



< 



3 
43 



ERIC 



- 66 - 



3 



< 
CO 



■< 
CO 



•O 60 

n CO C 

CQ I 



0 
-I a 



00 u 

sew 

CO 



N 
Id 



u 
c 



•O -I 

c - 
o 

o to 



to < 



C 



II 

c 



3 M 
U 



1 




0) 




r-( 


A 


00 


U 


c 


CO 


•H 




C/3 




Q) 

00 CQ 


fl) 




4J 




w 


c 


•o 


^ 


c 




0 


O 


u 


CO 






CO 




a> 




60 < 






■IJ 




CO 


c 


•o 








o 


u 


u 


CO 


a> 




CO 




■a 




VI 




lU 




H 








J2 


1^ 


u 


nj 


c 


o 




CO 






u« 




0) 




60 




nJ 


u 


■iJ 


c 


CO 








■iJ 




in 


u 


u 


CO 


















a> 


^ 


u 




u 


u 


3 


CO 


U 





3 



> 
< 



c 



ERIC 



rT3 
1^ 



- 67 - 



43 








(0 




B 




a 




> 

s 




«^ 




J3 




u 




< 












a 


w 




u> 


Vl 


s 


B 




10 


^ 


a 






u 




CO 


o 




u 




o 




o 






B 




0) 


•o 


S 


01 


>H 




U 


(Q 


01 


o 


0, 


w 


X 




u 


(D 




4J 




•^^ 


o 


(A 


In 


O 




P4 


Vi 


E 


B 


o 


4d 


u 


01 


E 


£ 






01 


t-" 




< 


o 


CO 


u 




CO 


T> 




s 


T> 


a 


o 






> 


m 


1 


u 


t-" 


CO 


< 




CO 


4J 




(0 




01 


o 


t-" 


u> 




B 








^ 

111 




■D 




U 




o 




1 








OS 





ERIC 



bO u 

c c 



01 

u to 

<D O B 

(II r-t 01 -r^ 

00 00 f-l 

. fi o o 
O. .rf o 

E CO CO CO 



00 
B 



00 «l -< 

B <o n 

■,^ -O o 

Vi Vi Vi 



3 CO 

u 



t-. B t1 

< « < 

CO CO 



00 

4J O B 

W 00 

H BJ — I 
^ (0 
b CO o 



00 u 

B B" 

O 4J 

^ M M 

(0 [14 01 

O t-" 

CO o 



n 01 B 

(II CO 00 *-( 

B 01 o 

• ^ o 

0< CO CO CO 
E 

u 



T> 00 

B 01 B 

O 00 -rt 

o (0 

d ^ m 

CO CO u 
CO 



00 00 

E « 10 

^ o 

CO CO CO 



4^ 
B 



1 -a 1 
t-< B t-< 

< « < 



CM 



CM 



u 



e 



00 

B 



a g 



00 



g g 3 



3 



U E 



E m 



B 

u 



to 


r-l 








m 


CD 


CO 


00 


m 




(M 




m 




m 




m 



to 


r-l 






r«. 


<n 


CD 


CO 


00 


m 




CM 




m 




m 


<r 


m 



pv. »H in 



w C7) 



»H OJ 















< 










33 




U 








U 




















2: 






w 


43 












M 








M 


X 










*3 




in 






60 


VI 










0 




u 










E 






43 


43 


0 


0) 


Am 


3 
U3 


Ma 


Ma 


ox 


Ch 









Oi 




§ 




CO 














_] 














B 




ul 


0 




c 






E 




g 


g 


ly 






u 


CI 


0 


a 




C3 




CO 



COM 

a u o 
E 

LI M U 

<a 01 

o t3 Q» 

U. U O 

to o u 



•a 
B 
O 





u 




(3 




.—1 








E 




T* 




(0 




(D 




4J 




-H 




3 




tr 








Ll 




























•6 




11 




u 




0 








a 




ax 




■0 








5 




< 








c 




-H 




.— t 








u 




tl 




(D 




£0 




(3 


0 




.—1 


LI 


0 


T5 


43 


C 




0 




0 


CO 






« 








& 


0 




M 


4J 


(3 




Ul 


0 


sa 




C 


b1 


H 


Jj 


u 




a> 


3 


•0 


m 


u 


0) 


0 









00 



- 68 - 



CO 



(0 




<ii 




u 




o 




o 










w 


4J 


01 


CO 






0 


E-t 


u 




w 


4J 




C 


z: 
1 


e 








> 


I/] 


a> 




•H 




J3 








< 






> 


(t-t 


1 


0 


H 




< 


C 


I/] 


0 




•H 




■IJ 




(0 




.-1 






M 




M 




0 




U 





H C 



41 



5^ 



O OJ 
CVJ r-l 

u-1 m 



u-i u-1 



CO CO 



CO CO 
CO r-j 
u-1 u-l 



CO CO 
(O o 
u-1 »n 



CO 
in u-i 



z a: 
W ►J 



z a: 



CO CO 
CO CO 
u-1 u-1 



t-i CO 

u-1 «n 



r-4 O) 
CO r-1 

U-1 *rt 



CO 

u-1 u-1 



Of*- --to 



.-I 00 



CO CsJ 
CsJ CO 
U-1 U-1 



.-I O «^ 0)0 



u-1 u-1 u-1 r*. 



(O ID 

u-1 m 



^ CO 

u-1 un 



O) u-1 
u-1 u-1 



CO to 



T-I ID ^ O) rsl 

^ ^ O O) Cf> O) 



CO iO 
ro C) 
u-1 .TV 



no 



- 69 - 



CD 



□ OOO" - + + xo<«n 



■ o < 



X •> +1 



X-* HM33ID 



X CKHlDD 



<j o €3 no 



X 4a 



+□□0 



o 
o 

00 



o 
o 



-T — 

o 
o 

CD 



o 
o 
in 



o 
o 

00 



o 
o 



o 
o 



8 OS 

-a CO 
> C 

o 
o 



o 
o 

CO 



o 
o 



o 
o 

CO 



o 
o 

CM 



0 

CO 
OS 



2 

01 

3 



-3 

§ 

:c3 



o 
u 



IS 

u 

CO 
CO 

H 



o 

CO 

co" g 

CO 

•r-i 

CO TO 



g 



OS 



a; 

P JO 



ERIC 



- 70 - 



□ Ono«" + + xo<«n 



O n - © 



a o C 4B 



□ O C1+ 



o 
o 
00 



o 
o 



-1— 

o 
o 

CD 



o 
o 
in 



— I— 

o 
o 



o 
o 
00 



o 
o 



o 
o 

CD 



o 
o 
in 



o 
o 



(U 

u 
o 
o 
CO 



Q 

CO 
O 



CO 



o 
o 

CO 



o 
o 

o 
o 

CO 



0 

o 



no 
00 



Q g 



S<'S 

^> 
Jh CO ^ 

^ e ^ 
so*'-' 

CD (D 

. . 5:3 ^ 

>^ Tl O 
4-" ^ 

8^2 

J) U V 

o 

c« C c/) 

•r-< 

T: 

CCl 

O 

^ C/J CD 

< 



o 

a 
J) 



a; J) 



3J00S P3I^-'S ^S3X ;U3UI3A3IL10V 



ERIC 



00 



- 71 - 





(V* S C_i Q_. 




o < • □ 








- 


•< o 




m < o D 




m < a 




m < no 




• 

■ — . 1 1 


I 



CD 



00 



B 13 

CO o 
CO 

CO CO jH 

(u a; 
CO ^ 



- lO 



- CO 



CVJ 



o 

CO 



o 
o 



o 
o 

CD 



o 
o 



o 
o 



o 
o 

CO 



;3 



5 

CO 

a; 



> 

B 



8 



Si 

O 



u 

o 
u 

u 

CO 
JO 



13 
§ 

♦l-H 

o 

CO 



2 

CO 



f3 

s 

•l-H 



1 

I 



(0 
(O 



3J0DS pajBOS ^S3X ;U3UI3A3Tt[DV 



C75 



ERIC 



- 72 - 



C?5 



□ OOO"" ■¥ + * o < • a 



HSHB o a 



Co a< 



K -CDOa 



X -fo c 



♦ mo 



o 
o 

00 



o 
o 



o 
o 

CO 



-I— 

o 
o 
in 



—r- 

o 

o 



o 
o 

00 



o 
o 



o 
o 



8 CO 

o "3 iS 



1 



a; 

CO 



o 
o 



o 
o 





w c C 

S c£ 

C V 
o 

a; w 

ffr S 

o ^ a; 
to C to 

•^s •o 

CO cd s 

a; O g 
^ CO cd 

CO 

^ GO 

feu 

&^ 

a; CO 



6 



a; 
u 

IDC 

E 



CJ5 



ERIC 



- 73 - 



□ Ono ■ ■ ♦ + xo<» n 



■ M]<i 



o DO Gi9 

DO (B& 
D OOO 



o 
o 

00 



"T— 

O 
O 



o 
o 

00 



o 
o 



o 
o 

CD 



2 
O 



u 
CO 



(5^ CO 

•r-l ' 

1/3 

CO 

o 
o 



o 
o 

CO 



o 
o 



O 

o 



o 
o 



o 
o 

o 

CO 



9J0DS paiBDS ^S3X :»uatnaA3mav 



05 



oco 



13 • 



(0 



(0 

1-1 



CO 

cd 



CO^ 



S c<2 

si's 

(0 g c; 

to S 
C (0 



u 

(0 



a> o 5 

^ CO cd 

^ --J to 

C o 
Sou 

i 6 » 

< to 



u 



OS 



- 74 




B 

o 



1 



CO 



o 

CO 



> 

i 

o 



a; 
13 

03 



o 



s, 

V 

u 

.2 

o 

CO 

o 



3J0DS P3IBDS :>s3X^uainaA3Tq3V 



CO 

(O S 

r3 CO 
'Z3 "O 

«0 e 
a; o c 

^ CO cd 

^ CO 

g C o 
goo 

< OJ CO 



- 75 - 



TO 

C73 



□ ODO ■ • * + X o < • a 



GUM* o cfl 



X -Km* 

• K + 

« a d- ■ CD 



o 
o 

CO 



o 
o 



o 
o 



— 1 — 
o 

s 



o 
o 

GO 



o 
o 



o 
o 

CO 



o u 

° I BP 

H CO 



CO r-" 



o 
o 



o 
o 

CO 



o 
o 



o 
o 

CVJ 



o 

CO 



E 

o 

tdD 

3 

CO 

a; 



> 

CI 

o 



CO 

c 

E C 



cs 

CO 



u 
.o 



o 

o 

CO 



rs o 



C 

o 



CO 

a; 

CO 



13 E 



cd 

CO 



^53 
C CO 



to 73 '2 

u g 
^ CO cd 

^ --J CO 

^ C O 
H c; o 

U 6 * 
Pi 

< c; CO 



tJJD 



- 76 - 



□ OOO ■ ■ ♦+xo<» D 



o 
o 

GO 



□ < 



CO 



OD GdDX^ ■ 
D O □ 



o 
o 

CD 



-J — 

o 
o 



-1 — 

o 
o 



o 
o 



o 
o 

(O 



a; 

^ H 
o u 

cd 
CO 



T3 



o 
o 



o 
o 

CO 



o 
o 
in 



o 
o 



o 
o 

CO 



o 
o 

CM 



CO 

u 



3 ^ 10£) 



CO 

m 



w o 
a; 

CO C (0 

U 'C 

CO cd s 

a; o g 

^ CO cd 

-Mrs ^ 
a; 

< a; 10 



O 

o 

CO 



ERIC 



.T3 

cr5 



- 77 - 



O 




- o> 



•oda 



4 CD 



- 00 



• <o a 



• < a 



- in 



• <w 



o 
o 

00 



o 



-I — 
o 
o 



—I — 
o 
o 

IT) 



CO 

CO 

0 tyo 

(O fo CC 

GO I 

1 s 



- CO 



CM 



O 

o 



o 
o 

CO 



6^ 
©CO 

CO 



2 

CO 

a; 



CO 



CO 

E c 

o 

g 



•»-H 

O 



u 

o 
u 

CO 

13 

a 

CO 



CO 

1-1 



CO 

CO 6 
C3 CO 

O g 
^ CO 

^ --J CO 

!■§ s 

c o 

got) 

1 6 " 

< t) CO 



ERIC 



- 78 - 




• o d 



•a< 



oa < 



CO • 



- o 

CD 



O 

o 

00 



—r~ 

o 
o 



-f— 

o 
o 



O 

o 
in 



-1— 

o 
o 



o 
o 

00 



o 
o 



8 o 

CO c 

^ © 

g -I £ 



c5j q 



o 
o 



100 

.S 

u 
CO 



o 
o 

CO 



o 
o 

o 

CO 



3J0DS pajBDS :»S3X :»u3ni3A3TH0V 



o 



o 



2> 

a; 
a; 



2 



o 
o 

CO 



CO ^ 

c c 

O 

13 
c 

O ^ 

T3 O 



u 



a 
o 



2 

a; 
a; 



13 cp:e 



05 

u 

(0 



IOCS 



'TS 'O 
(0 CO s 

a; o c 
^ CO cd 

^ •-J CO 

< a; (0 



o 
u 

(0 



Si 



o 



ERIC 



- 79 - 




o 

00 



o 



o 
o 



o CO 

u 



o 
o 



o 
CO 



o 
o 

CO 



o 
o 

CM 



3J0DS papsos ;s3x:^u3raaA3THDV 



O 



ERIC 



- 80 - 



O 5 w 
o < • o 



- O) 



mwtn 



- CO 



4 CD 



• < o a 



o 

o 

CO 



o 
o 



-I— 

o 

o 



o 
o 

ID 



— I— 

o 
o 



CO 

t| 

CO C 

O Jh 
to ro ^ 
v O 



0) 

B 
CO 



- in 



S 

-a 

u 
CO 



- CO 



CM 



O 
O 



TO 

o 



oco 5 

CO 



u 
.o 



SIS 

si's 

CO C a; 
o 

CO C CO 




lac 



ERIC 



o 



- 81 - 



O 



X <XB» ED 



m a oi- ■ QDO 



>x D+ ■ -i-OD 



o 
o 

CO 



o 
o 



—] — 

o 
o 

<D 



-r- 

o 
o 



-r~ 

o 
o 



o 
o 

00 



o 
o 



o 
o 



o 
o 
m 



Cfl 
CO 

I ^ 

I ^ 

< o 
o 



o 
o 



o 
o 

CO 



o 
o 

CO 



o 
o 



3J0DS psiBDS ^S9x;u3m3A3Tqav 



O CO 



CO 

a; 

CO 



2>a 
>-« co-^ 



ill 

8^2 



CO 

o ^ a; 

(0 C 



CO C 



CO ^5 



0) 
(0 

(yog 
^ CO CO 

^ CO 

cu c! O 
a; c; 



s 

0) 



< P CO 



o 



ERIC 



- 82 - 



□ ODO ■■ * + * o < • n 



a < 



COCEH- 



CDCD-4-X 



o 
o 

00 



o 
o 



o 
o 



o 
o 



o 

•i 

u 
CO 



o 
o 



a; 

-3 

o 
CO 

CO 

I 

o 
a 

CO 



a o O □!□+•• ■ ■ • 



O □ ♦ 



o 
o 

09 



O 

o 

1^ 



— 1 — 

o 
o 



O 
O 

in 



-I— 

o 
o 



o 
o 

CO 



o 
o 

o 

CO 



a; 

W W ^ 

t) X ^ 
13 ^ E 

CO C tf) 

"Th 'O 

(0 cd s 

g C o 

S ^ p 

< « CO 



ERIC 



- 83 - 



o < • o 



• < o 



• < O D 



o 
o 

00 



-J— 

o 
o 



o 
o 



o 
o 



"T — 
O 

o 



at 



CO 



IS CO 



o cd 
g CO 

CO *0 



- in 



O 
O 

a; 
CO 



- CO 



CM 



o 
o 

C7 




^^ CO^ 

!§!> 

(0 9 V 
o -*-» 

•O o CO 
a; o 

13 cp* £ 

(0 C (0 

D o g 

ciS a 

g G o 
goo 

IS-" 

< a; «} 



a; 
u 



ERIC 



- 84 - 



800 



I 600 



6 
I 



500 



* 

s 

8 



B 

a 



□ 

o 



300- 
200 



300 



400 500 

SAT-V Scaled Score 



600 



700 



□ 


EN 


o 


LR 


□ 


AH 


o 


EH 


• 


Ml 




M2 




BT 


■f 


CH 


M 


PH 


• 


FR 




GM 


• 


LT 


□ 


SP 







CD 



i 




500 

SAT-M Scaled Score 



800 




Figure 2a: 
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Figure 2b: Comparison of Achievement Test condlUonal scaled score 
for three scaling covariates. 



means 



- 86 - 



•8 



800 



700 



600 



I 600 



400 



300H 
200 



1 



I 

e 



9 

e 
i 



i 
m 

B 



f 

e 



300 



400 600 800 

SAT-VScJed Score 



700 



□ 


EN 


o 


LR 


D 


AH 


e 


EH 


■ 


Ml 




ID 


♦ 


ar 


♦ 


CH 


> 


PH 


• 


FR 




OM 


• 


LT 


■ 


SP 


m 



800 



o 



eoo 



700 



600 



•8 
I 

I 600- 



400- 



e 

9 



o 

□ 



o 
B 

■ 



B 
♦ 



300i 
200 



300 



400 



500 800 
SAT-M Scaled Scare 



700 



□ 


EN 


o 


IR 


a 


AH 


o 


EH 


u 


Ml 


• 


M3 


♦ 


nr 


♦ 


CH 


> 


PH 


• 


TR 


A 


OM 


m 


LT 


u 


SP 


m 



800 



ERIC 



•8 

I 



5 




6 6 7 

Semesten of Study 
First-Stage Scale 



Flfiure 2c: Comparison of Achievement Test condiUonal scaled score 
for three scaling covailates. 
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Figure 2d- Comparison of Achievement Test condlUonai scaled score means 
for t}^^ scaling covarlatcs. 
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Figure 2e: Comparison of Achievement Test condlUonal scaled score 
for three scaling covarlates. 
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